Topics: SBT, scalabha, packages, build systems

Preface

This is part 11 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial gives an introduction to building Scala applications using SBT (the Simple Build Tool). This will be done in the context of the Scalabha package, which I have created for primarily for my Introduction to Computational Linguistics class. Some supporting code is available in Scalabha for some basic natural language processing tasks; most relevant at the moment is the code that is in Scalabha that supports the part-of-speech tagging homework for the class.

The previous tutorial showed how Scala code can be compiled with scalac and then run with scala. One problem we ended up with is that there were generated class files littering the working directory. Another thing we did not discuss is how a large system can be created in a modular way that organizes code and classes. For example, you might want to have code in different directories generate classes that can be used by one another. You also may want want to incorporate classes from other libraries into your own code. The solutions we’ll discuss to address these needs and more are build systems and packages.

Note: The tutorial assumes you are using some version of Unix. If you are on Windows, you should consider using Cygwin, or you could dual boot your computer.

Note: In this tutorial, I’ll assume you are using as simple text editor to modify files. However, note that the general setup you are working with here can be used from more powerful Integrated Developer Environements (IDEs) like Eclipse, IntelliJ, and NetBeans.

Setting up Scalabha

We’ll work with SBT, which is perhaps the most popular build tool for Scala.  The Scalabha toolkit mentioned earlier uses SBT (version 0.11.0), so we’ll discuss SBT in the Scalabha context.

The first thing you need to do is download Scalabha v0.1.1 Next unzip the file, change to the directory it unpacked to, and list the directory contents.

$ unzip scalabha-0.1.1-src.zip
Archive:  scalabha-0.1.1-src.zip
<lots of output>
$ cd scalabha-0.1.1
$ ls
CHANGES.txt README      build.sbt   project
LICENSE     bin         data        src

Briefly, these contents are:

  • README: A text file describing how to install Scalabha on your machine.
  • LICENSE: A text file giving the license, which is the Apache Software License 2.0.
  • CHANGES.txt: A text file describing the modifications made for each version (not much so far).
  • build.sbt: A text file that contains instructions for SBT regarding how to build Scalabha
  • bin: A directory that contains the scalabha script, which will be used to run applications developed within the Scalabha build system and also to run SBT itself. It also contains sbt-launch-0.11.0.jar, which is a bottled up package of SBT’s classes that will allow us to use SBT very easily. There are some other files that are Perl scripts that are relevant for a research project and aren’t important here.
  • data: A directory containing part-of-speech tagged data for English and Czech that forms the basis for the fourth homework of my Introduction to Computational Linguistics course this semester.
  • project: A directory containing a single file “plugins.sbt” which tells SBT to use the Assembly plugin. More on this later.
  • src: The most important directory of all — it contains the source code of the Scalabha system, and is where you’ll be adding some code as you work with SBT.

At this point you should read the README and get Scalabha set up on your computer, including building the system from source. In this tutorial, I will give some extra details on using SBT and code development with it, complementing and extending the brief information given in the README.

Note that I will refer the environment variable SCALABHA_DIR below. As specified in the README, you should set this variable’s value to be where you unpacked Scalabha. For example, for me this directory is ~/devel/scalabha.

Tip: to make it so that you don’t have to set your environment variables every time you open a new shell, you can set environment variables in your ~/.profile (Mac, Cygwin) or ~/.bash_aliases (Ubuntu) files. For example, this is in my profile files on my machines.

export SCALABHA_DIR=$HOME/devel/scalabha
export PATH=$PATH:$SCALABHA_DIR/bin

SBT: The Simple Build Tool

This is not a tutorial about setting up a project to use SBT — it is simply about how to use a project that is already set up for SBT. So, if you are looking for resources about learning SBT, what you’ll mainly find are resources to help programmers configure SBT for their project. These will likely confuse you (the Simple Build Tool is not so simple any more, when it comes to configuration). Using it is straightforward, but the kind of know-how that experienced coders have with using something like SBT is what you probably won’t find much help on. Here, I intend to give the basics so that you have a better starting point for doing more with SBT.

First off, there is a bit of slight of hand with Scalabha that could be confusing. Rather than having users install SBT themselves, I have put the jar file for SBT in the bin directory of Scalabha; then, the scalabha executable (in that same directory) can pick that up and use it to run SBT. (My students and I have set up a number of Scala/Java projects in this way, including Fogbow, Junto, Textgrounder, and Updown.) The scalabha executable has a number of execution targets (more on this later), and one of these is “build“. When you call scalabha’s build target, it invokes SBT and drops you into the SBT interface.

Do the following, in your SCALABHA_DIR.

$ scalabha build
[info] Loading project definition from /Users/jbaldrid/devel/scalabha/project
[info] Set current project to Scalabha (in build file:/Users/jbaldrid/devel/scalabha/)
>

You could have achieved the same by downloading SBT and running it according to the instructions for SBT, but this setup saves you that trouble and ensures that you get the right version of SBT. It is just worth pointing out so that you don’t think that Scalabha is SBT –  SBT is entirely independent of Scalabha.

If you have had any trouble with the Scalabha setup, you can create an issue on the Scalabha Bitbucket site. That just means that I’ll get a notice that you had some problems and can hopefully help you out. And, it is possible that someone else will have had the same problem, in which case you might find your answer there. Most of the problems with this sort of setup are due to confusions about environment variables and unfamiliarity with command line tools.

Compiling with SBT

Let’s actually do something with SBT now. If you successfully got through the README, you will have already done what is next, but I’ll give some more details about what is going on.

Because you may have run some SBT actions already as part of doing the README, start out by running the “clean” action so that we’re on the same page.

> clean
[success] Total time: 0 s, completed Oct 26, 2011 10:18:08 AM

Then, run the “compile” action.

> compile
[info] Updating {file:/Users/jbaldrid/devel/scalabha/}default-86efd0...
[info] Done updating.
[info] Compiling 13 Scala sources to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 9 s, completed Oct 26, 2011 10:18:19 AM

In another shell (which means another command line window), go to SCALABHA_DIR and list the contents of the directory. You’ll see that two new directories have been created, lib_managed and target. The first is where other libraries have been download from the internet and placed into the Scalabha project space so that they can be easily used — don’t worry about this for the time being. The second is where the compiled class files have gone. To see some example class files, do the following.

$ ls target/classes/opennlp/scalabha/postag/
BaselineTagger$$anonfun$tag$1.class
BaselineTagger.class
EnglishTagInfo$$anonfun$zipWithTag$1$1.class
<... many more class files ...>
RuleBasedTagger$$anonfun$tag$2.class
RuleBasedTagger$$anonfun$tagWord$1.class
RuleBasedTagger.class

These were generated from the following source files.

$ ls src/main/scala/opennlp/scalabha/postag/
HmmTagger.scala PosTagger.scala

Open up PosTagger.scala in a text editor and look at it — you’ll see the class and object definitions that were the sources for the generated class files in the target/classes directory. Basically, SBT has conveniently handled the separation of source and compile class files so that we don’t have the class files littering our work space.

How does SBT know where the class files are? Simple: it is configured to look at src/main/scala and compile every .scala file it finds under that directory. In just a bit, you’ll start adding your own scala files and be able to compile and run them as part of the Scalabha build system.

Next, at the SBT prompt, invoke the “package” action.

> package
[info] Updating {file:/Users/jbaldrid/devel/scalabha/}default-86efd0...
[info] Done updating.
[info] Packaging /Users/jbaldrid/devel/scalabha/target/scalabha-0.1.1.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Oct 26, 2011 10:19:02 AM

In the shell prompt that we used to list files previously, list the contents of the target directory.

$ ls target/
cache              classes            scalabha-0.1.1.jar streams

You have just created scalabha-0.1.1.jar, a bottled up version of the Scalabha code that others could use in their own libraries. The extension “jar” stands for Java Archive, and it is basically just a zipped up collection of a bunch of class files.

Scalabha itself uses another of supporting libraries produced by others. To see the jars that are used as supporting libraries by Scalabha, do the following.

$ ls lib_managed/jars/*/*/*.jar
lib_managed/jars/jline/jline/jline-0.9.94.jar
lib_managed/jars/junit/junit/junit-3.8.1.jar
lib_managed/jars/org.apache.commons/commons-lang3/commons-lang3-3.0.1.jar
lib_managed/jars/org.clapper/argot_2.9.1/argot_2.9.1-0.3.5.jar
lib_managed/jars/org.clapper/grizzled-scala_2.9.1/grizzled-scala_2.9.1-1.0.8.jar
lib_managed/jars/org.scalatest/scalatest_2.9.0/scalatest_2.9.0-1.6.1.jar

Of course, you may still be wondering what it means to “use a library” in your code. More on this after we talk about packages and actually start doing some code ourselves.

Packages

Projects with a lot of code are generally organized into a package that has a set of sub-packages for parts of the code base that work closely together. At the very high level, a package is simply a way to ensure that we have unique fully qualified names for classes. For example, there is a class called Range in the Apache Commons Lang library and in the core Scala library. If you want to use both of these classes in the same piece of code, there is an obvious problem of a name conflict. Fortunately, they are contained within packages that allow us to refer to them uniquely.

  • Range in the Apache Commons Lang library is org.apache.commons.lang3.Range
  • Range in Scala is scala.collection.immutable.Range

So, when we do need to use them together, we are still able to do so without conflict. You’ve actually already seen some package names before, for example with java.lang.String and the distinction between scala.collection.mutable.Map and scala.collection.immutable.Map.

To see the packages and classes in Scalabha, run the “doc” action in SBT.

> doc
[info] Generating API documentation for main sources...
model contains 35 documentable templates
[info] API documentation generation successful.
[success] Total time: 7 s, completed Oct 26, 2011 10:22:23 AM

Now, point your browser to the file target/api/index.html. Note: this means doing “open file” and then going to your SCALABHA_DIR and then to target, then to api, and then selecting index.html. You can then browse the packages and classes in Scalabha. For example, look at HmmTagger, which is in the package opennlp.scalabha.postag, and you’ll see some of the fields and functions that are made available by that class.

But, you may still be wondering: how do I use these packages and classes in my code anyway? We do so via import statements. We’ll explore this by creating our own source code and compiling it.

Creating and compiling new code in SBT

First, we’ll begin by just doing a simple hello world application that is done in the context of Scalabha and uses a package name. Get set up for this by doing the following set of commands.

Now, point your browser to the file target/api/index.html. Note: this means doing “open file” and then going to your SCALABHA_DIR and then to target, then to api, and then selecting index.html. You can then browse the packages and classes in Scalabha. For example, look at HmmTagger, which is in the package opennlp.scalabha.postag, and you’ll see some of the fields and functions that are made available by that class.

But, you may still be wondering: how do I use these packages and classes in my code anyway? We do so via import statements. We’ll explore this by creating our own source code and compiling it.

Creating and compiling new code in SBT

First, we’ll begin by just doing a simple hello world application that is done in the context of Scalabha and uses a package name. Get set up for this by doing the following set of commands.

$ cd $SCALABHA_DIR
$ cd src/main/scala/opennlp/
$ mkdir bcomposes

Next, using a text editor, create the file Hello.scala in the src/main/scala/opennlp/bcomposes directory with the following contents.

package opennlp.bcomposes

object Hello {
  def main (args: Array[String]) = println("Hello, world!")
}

This is just like the hello world object from the previous tutorial, but now it has the additional package specification that indicates that its fully qualified name is opennlp.bcomposes.Hello.

Because the source code for Hello.scala is in a sub-directory of the src/main/scala directory, we can now compile this file using SBT. Make sure to save Hello.scala, and then go back to your SBT prompt and type “compile“.

> compile
[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 1 s, completed Oct 26, 2011 10:35:15 AM

Notice that it compiled just one Scala source: SBT has already compiled the other source files in Scalabha, so it only had to compile the new one that you just saved.

Having successfully created and compiled the opennlp.bcomposes.Hello object, we can now run it. The scalabha executable provides a “run” target that allows you to run any of the code you’ve produced in the Scalabha build setup. In your shell, type the following.

$ scalabha run opennlp.bcomposes.Hello
Hello, world!

There is actually a bunch of stuff going on under the hood that ensures that your new class is included in the CLASSPATH and can be used in this manner (see bin/scalabha for details). This will simplify things for you considerable. To make a long story short, getting the CLASSPATH appropriately set is one of the main points of confusion for new developers; this way you can keep on moving without having to worry about what is essentially a plumbing problem.

Now, let’s say you want to change the definition of the Hello object to also print out an additional message that is supplied on the command line. Modify the main method to look like this.

def main (args: Array[String]) {
  println("Hello, world!")
  println(args(0))
}

Now save it, and try running it.

$ scalabha run opennlp.bcomposes.Hello Goodbye
Hello, world!

Oops — it didn’t work?! I’ve just forced you directly into a common point of confusion for students who are switching from scripting to compiling: you must compile before it can be used. So, invoke compile in SBT, and then try that command again.

$ scalabha run opennlp.bcomposes.Hello Goodbye
Hello, world!
Goodbye

To see what happens when you produce a syntax error in your Scala code, go back to Hello.scala and change first print statement in the main method so that it is missing the last quote:

println("Hello, world!)

Now go back to SBT and compile again to see the love letter you get from the Scala compiler.

[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[error] /Users/jbaldrid/devel/scalabha/src/main/scala/opennlp/bcomposes/Hello.scala:5: unclosed string literal
[error]     println("Hello, world!)
[error]             ^
[error] /Users/jbaldrid/devel/scalabha/src/main/scala/opennlp/bcomposes/Hello.scala:7: ')' expected but '}' found.
[error]   }
[error]   ^
[error] two errors found
[error] {file:/Users/jbaldrid/devel/scalabha/}default-86efd0/compile:compile: Compilation failed
[error] Total time: 0 s, completed Oct 26, 2011 11:02:07 AM

The compile attempt failed, and you must go back and fix it. But don’t do that yet. There’s a handy aspect of SBT in this write-save-compile loop that saves you time and effort: SBT allows triggered executation of actions, which means that SBT can automatically perform an action if there is a change to the stuff it cares about. The compile action cares about the source code, so it can monitor changes in the file system and automatically recompile any time a file is saved. To do this, you simply add ~ in front of the action.

Before fixing the error, type ~compile into SBT. You’ll see the same error message as before, but don’t worry about that. The last line of output from SBT will say:

1. Waiting for source changes... (press enter to interrupt)

Now go to Hello.scala again, add the quote back in, and save the file. This triggers the compile action in SBT, so you’ll see it automatically compile, with a success message.

[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 0 s, completed Oct 26, 2011 11:02:49 AM
2. Waiting for source changes... (press enter to interrupt)

This is a nice way to see if your code is compiling as you work on it, with very little effort. Every time you save the file, it will let you know if there are problems. And, you’ll also be able to use the scalabha run target and know that you are using the latest compiled version when you do so.

As you develop your code in this way, you can invoke the “doc” action in SBT, then reload the index.html page in your browser, and it will show you the updated documentation for the things you’ve created. Try it now and look at the opennlp.bcomposes package that you’ve now created.

Creating code that uses existing packages

Now we can come back to using code from existing packages. In the past (if you’ve gone through all of these tutorials), you’ve seen statements like import scala.io.Source. That came from the standard Scala library, so it is always available to any Scala program. However, you can also use classes developed by others in a similar manner, provided your CLASSPATH is set up such that they are available. That is exactly what SBT does for you: all of the classes that are defined in the src/main/scala sub-directories are ready for your use.

As an example, save the following code as src/main/scala/opennlp/bcomposes/TreeTest.scala. It constructs a standard phrase structure tree for the sentence “I like coffee.”

package opennlp.bcomposes

import opennlp.scalabha.model.{Node,Value}

object TreeTest {

  def main (args: Array[String]) {
    val leaf1 = Value("I")
    val leaf2 = Value("like")
    val leaf3 = Value("coffee")
    val subjNpNode = Node("NP", List(leaf1))
    val verbNode = Node("V", List(leaf2))
    val objNpNode = Node("NP", List(leaf3))
    val vpNode = Node("VP", List(verbNode, objNpNode))
    val sentenceNode = Node("S", List(subjNpNode, vpNode))

    println("Printing the full tree:\n" + sentenceNode)
    println("\nPrinting the children of the VP node:\n" + vpNode.children)

    println("\nPrinting the yield of the full tree:\n" + sentenceNode.getTokens.mkString(" "))
    println("\nPrinting the yield of the VP node:\n" + vpNode.getTokens.mkString(" "))
  }

}

There are a few things to note here. The import statement at the top is what tells Scala the fully qualified package names for the classes Node and Value. You could have equivalently written it less concisely as follows.

import opennlp.scalabha.model.Node
import opennlp.scalabha.model.Value

Or, you could have left out the import statement and written the fully qualified names everywhere, e.g.:

val leaf1 = opennlp.scalabha.model.Value("I")

Second, Node and Value are case classes. We’ll discus this more later, but for now, all you need to know is that to create an object of the Node or Value classes, it isn’t necessary to use the “new” keyword.

Third, the print statements are using the Scalabha API (Application Programming Interface) to do useful things with the objects, such as printing out the tree they describe, printing the yield of the nodes (the words that they cover), and so on. The scaladoc you looked at before for Scalabha shows you these functions, so go have a look if you haven’t already.

Note that if you had left the triggered compilation on, SBT will have automatically compiled the TreeTest.scala. Otherwise, make sure to compile it yourself. Then, run it.

$ scalabha run opennlp.bcomposes.TreeTest
Printing the full tree:
Node(S,List(Node(NP,List(Value(I))), Node(VP,List(Node(V,List(Value(like))), Node(NP,List(Value(coffee)))))))

Printing the children of the VP node:
List(Node(V,List(Value(like))), Node(NP,List(Value(coffee))))

Printing the yield of the full tree:
I like coffee

Printing the yield of the VP node:
like coffee

Make and use your own package

By importing the classes you need in this manner, you can get more done by using them as you need. Any class in Scalabha or in the libraries that are included with it will be available for you, including any classes you define. As an example, do the following.

$ cd $SCALABHA_DIR/src/main/scala/opennlp/bcomposes
$ mkdir person
$ mkdir music

Now save the Person class from the previous tutorial as Person.scala in the person directory. Here’s the code again (note the addition of the package statement).

package opennlp.bcomposes.person

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

Now save the following as RadioheadGreeting.scala in the music directory.

package opennlp.bcomposes.music

import opennlp.bcomposes.person.Person

object RadioheadGreeting {

  def main (args: Array[String]) {
    val thomYorke = new Person("Thom", "Yorke", 43, "musician")
    val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
    val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
    val edObrien = new Person("Ed", "O'Brien", 42, "musician")
    val philSelway = new Person("Phil", "Selway", 44, "musician")
    val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
    radiohead.foreach(bandmember => println(bandmember.greet(false)))
  }

}

When we did the compilation tutorial previously, Person.scala and RadioheadGreeting.scala were in the same directory, which allowed the latter to know about the Person class. Now that they are in separate packages, the Person class must be explicitly imported; once you’ve done so, you can code with Person objects just as you did before.

Finally, to run it, we now must specify the fully qualified package name for RadioheadGreeting.

$ scalabha run opennlp.bcomposes.music.RadioheadGreeting
Hi, I'm Thom!
Hi, I'm Johnny!
Hi, I'm Colin!
Hi, I'm Ed!
Hi, I'm Phil!

A note on package names and their relation to directories

Package names are made unique by certain conventions that generally ensure you won’t get clashes. For example, we are using opennlp.scalabha and opennlp.bcomposes, which I happen to know are unique. Quite often these names will include full internet domains, in reverse, like org.apache.commons and com.cloudera.crunch. By convention, we put the source files that are in packages (and subpackages) in directory structures that reflect the names. So, for example, opennlp.bcomposes.music.RadioheadGreeting is in the directory src/main/scala/opennlp/bcomposes/music. However, it is worth noting that this is not a hard constraint with Scala (as it is with Java).

There is a great deal more to using a build system, but this is where I must end this discussion, hoping it is enough to get the core concepts across and make it possible for my students to do the homework on part-of-speech tagging and making use of the opennlp.scalabha.postag package!

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

 

Topics: scripting, compiling, main methods, return values of functions

Preface

This is part 10 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.
The tutorials up to this point have been based on working with the Scala REPL or running basic scripts that are run from the command line. The latter is called “scripting” and usually is done for fairly simple, self-contained coding tasks. For more involved tasks that require a number of different modules and accessing libraries produced by others, it is necessary to work with a build system that brings together your code, others’ code, allows you to compile it, test it, and package it so that you can use it as an application.

This tutorial takes you from running Scala scripts to compiling Scala programs to create byte code that can be shared by different applications. This will act as a bridge to set you up for the next step of using a build system. Along the way, some points will be made about objects, extending on some of the ideas from the previous tutorial about object-oriented programming. At a high level, the relevance of objects to a larger, modularized code base should be pretty clear: objects encapsulate data and functions that can be used by other objects, and we need to be able to organize them so that objects know how to find other objects and class definitions. Build systems, which we’ll look at in the next tutorial, will make this straightforward.

Running Scala scripts

In the beginning, you started with the REPL.

scala> println("Hello, World!")
Hello, World!

Of course, the REPL is just a (very useful) playground for trying out snippets of Scala code, not for doing real work. So, you saw that you could put code like println(“Hello, World!”) into a file called Hello.scala and run it from the command line.

$ scala Hello.scala
Hello, World!

The homeworks and tutorials done so far have worked in this way, though they are a bit more complex. We can even include class definitions and objects created from a class. For example, using the Person class from the previous tutorial, we can put all the code into a file called People.scala (btw, this name doesn’t matter — could as well be Blurglecruncheon.scala).

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

val johnSmith = new Person("John", "Smith", 37, "linguist")
val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
val johnDoe = new Person("John", "Doe", 43, "philosopher")
val johnBrown = new Person("John", "Brown", 28, "mathematician")

val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
people.foreach(person => println(person.greet(true)))

This can now be run from the command line, producing the expected result.

$ scala People.scala
Hello, my name is John Smith. I'm a linguist.
Hello, my name is Jane Doe. I'm a computer scientist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

However, suppose you wanted to use the Person class from a different application (e.g. that is defined in a different file). You might think you could save the following in the file Radiohead.scala, and then run it with Scala.

val thomYorke = new Person("Thom", "Yorke", 43, "musician")
val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
val edObrien = new Person("Ed", "O'Brien", 42, "musician")
val philSelway = new Person("Phil", "Selway", 44, "musician")
val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
radiohead.foreach(bandmember => println(bandmember.greet(false)))

However, if you do “scala Radiohead.scala” you’ll see five errors, each one complaining that the type Person wasn’t found. How could Radiohead.scala know about the Person class and where to find its definition? I’m not aware of a way to do this with scripting-style Scala programming, and even though I suspect there may be a way to do something this simple, I don’t even care to know it. Let’s just get straight to compiling.

Compiling

The usual thing we do with Scala is to compile our programs to byte code. We won’t go into the details of that, but it basically means that Scala turns the text of a Scala program into a compiled set of machine instructions that can be interpreted by your operating system. (It actually compiles to Java byte code, which is one reason it is pretty straightforward to use Java code when coding in Scala.)

So, what does compilation look like? We need to start by changing the code we did above a bit. Make a directory that has nothing in it, say /tmp/tutorial. Then save the following as PersonApp.scala in that directory.

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

object PersonApp {

  def main (args: Array[String]) {
    val johnSmith = new Person("John", "Smith", 37, "linguist")
    val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
    val johnDoe = new Person("John", "Doe", 43, "philosopher")
    val johnBrown = new Person("John", "Brown", 28, "mathematician")

    val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
    people.foreach(person => println(person.greet(true)))
  }

}

Notice that the code looks pretty similar to the script above, but now we have a PersonApp object with a main method. The main method contains all the stuff that the original script had after the Person definition. Notice also that there is an args argument to the main method, which should look familiar now. What you are seeing is that a Scala script is basically just a simplified view of an object with a main method. Such scripts use the convention that the Array[String] provided to the method is called args.

Okay, so now consider what happens if you run “scala PersonApp.scala” — nothing at all. That’s because there is no executable code available outside of the object and class definitions. Instead, we need to compile the code and then run the main method of specific objects. The next step is to run scalac (N.B. “scalac” with a “c”, not “scala”) on PersonApp.scala. The name scalac is short for Scala compiler. Do the following steps in the /tmp/tutorial directory.

$ scalac PersonApp.scala
$ ls
Person.class                    PersonApp.class
PersonApp$$anonfun$main$1.class PersonApp.scala
PersonApp$.class

Notice that a number of *.class files have been generated. These are byte code files that the scala application is able to run. A nice thing here is that it all the compilation is done: when in the past you ran “scala” on your programs (scripts), it had to first compile the instructions and then run the program. Now we are separating these steps into a compilation phase and a running phase.

Having generated the class files, we can run any object that has a main method, like PersonApp.

$ scala PersonApp
Hello, my name is John Smith. I'm a linguist.
Hello, my name is Jane Doe. I'm a computer scientist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

Try running “scala Person” to see the error message it gives you.

Next, move the Radiohead.scala script that you saved earlier into this directory and run it.

$ scala Radiohead.scala
Hi, I'm Thom!
Hi, I'm Johnny!
Hi, I'm Colin!
Hi, I'm Ed!
Hi, I'm Phil!

This is the same script, but now it is in a directory that contains the Person.class file, which tells Scala everything that Radiohead.scala needs to construct objects of the Person class. Scala makes available any class file that is defined in the CLASSPATH, an environment variable that by default includes the current working directory.

Despite this success, we’re going away from script land with this post, so change the contents of Radiohead.scala to be the following.

object RadioheadGreeting {

  def main (args: Array[String]) {
    val thomYorke = new Person("Thom", "Yorke", 43, "musician")
    val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
    val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
    val edObrien = new Person("Ed", "O'Brien", 42, "musician")
    val philSelway = new Person("Phil", "Selway", 44, "musician")
    val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
    radiohead.foreach(bandmember => println(bandmember.greet(false)))
  }

}

Then run scalac on all of the *.scala files in the directory. There are now more class files, corresponding to the RadioheadGreeting object we defined.

$ scalac *.scala
$ ls
Person.class                            Radiohead.scala
PersonApp$$anonfun$main$1.class         RadioheadGreeting$$anonfun$main$1.class
PersonApp$.class                        RadioheadGreeting$.class
PersonApp.class                         RadioheadGreeting.class
PersonApp.scala

You can now run “scala RadioheadGreeting” to get the greeting from the band members. Notice that the file RadioheadGreeting was saved in was called Radiohead.scala and that no class files were generated called Radiohead.class, etc. Again, the file name could have been named something entirely different, like Turlingdrome.scala. (Embrace your inner Vogon.)

Multiple objects in the same file

There is no problem having multiple objects with main methods in the same file. When you compile the file with scalac, each object generates its own set of class files, and you call scala on whichever class file contains the definition for the main method you want to run. As an example, save the following as Greetings.scala.

object Hello {
  def main (args: Array[String]) {
    println("Hello, world!")
  }
}

object Goodbye {
  def main (args: Array[String]) {
    println("Goodbye, world!")
  }
}

object SayIt {
  def main (args: Array[String]) {
    args.foreach(println)
  }
}

Next compile the file and then you can run any of the generated class files (since they all have main methods).

$ scalac Greetings.scala
$ scala Hello
Hello, world!
$ scala Goodbye
Goodbye, world!
$ scala Goodbye many useless arguments
Goodbye, world!
$ scala SayIt "Oh freddled gruntbuggly" "thy micturations are to me" "As plurdled gabbleblotchits on a lurgid bee."
Oh freddled gruntbuggly
thy micturations are to me
As plurdled gabbleblotchits on a lurgid bee.

In case you missed it earlier, the args array is where the command line arguments go and you can thus make use of them (or not, as in the case of the Hello and Goodbye objects).

Functions with return values versus those without

Some functions return a value while others do not. As a simple example, consider the following pairs of functions.

scala> def plusOne (x: Int) = x+1
plusOne: (x: Int)Int

scala> def printPlusOne (x: Int) = println(x+1)
printPlusOne: (x: Int)Unit

The first takes an Int argument and returns an Int, which is a value. The other takes an Int and returns Unit, which is to say it doesn’t return a value. Notice the difference in behavior between the two following uses of the functions.

scala> val foo = plusOne(2)
foo: Int = 3

scala> val bar = printPlusOne(2)
3
bar: Unit = ()

Scala uses a slightly subtle distinction in function definitions that can distinguish functions that return values versus those that return Unit (no value): If you don’t use an equals sign in the definition, it means that the function returns Unit.

scala> def plusOneNoEquals (x: Int) { x+1 }
plusOneNoEquals: (x: Int)Unit

scala> def printPlusOneNoEquals (x: Int) { println(x+1) }
printPlusOneNoEquals: (x: Int)Unit

Notice that the above definition of plusOneNoEquals returns Unit, even though it looks almost identical to plusOne defined earlier. Check it out.

scala> val foo = plusOneNoEquals(2)
foo: Unit = ()

Now look back at the main methods given earlier. No equals. Yep, they don’t have a return value. They are the entry point into your code, and any effects of running the code must be output to the console (e.g. with println or via a GUI) or written to the file system (or the internet somewhere). The outputs of such functions (ones which do not return a value) are called side-effects. You need them for the main methods. However, in many styles of programming, a great deal of work is done with side-effects. I’ve been trying to gently lead the readers of this tutorial to adopt a more functional approach that tries to avoid them. I’ve found it a more effective style myself in my own coding, so I’m hoping it will serve you all better to start from that point. (Note that Scala supports many styles of programming, which is nice because you have choice and can go with what you find most suitable.)

Cleaning up

You may have noticed that the directory you are working in as you run scalac on your scala files becomes quite littered with class files. For example, here’s what the state of the code directory worked with in this tutorial looks like after compiling all files.

$ ls
Goodbye$.class                          PersonApp.scala
Goodbye.class                           Radiohead.scala
Greetings.scala                         RadioheadGreeting$$anonfun$main$1.class
Hello$.class                            RadioheadGreeting$.class
Hello.class                             RadioheadGreeting.class
Person.class                            SayIt$$anonfun$main$1.class
PersonApp$$anonfun$main$1.class         SayIt$.class
PersonApp$.class                        SayIt.class
PersonApp.class

A mess, right? Generally, one would rarely develop a Scala application by compiling it directly in this way. Instead a build system is used to manage the compilation process, organize the files, and allow one to easily access software libraries created by other developers. The next tutorial will cover this, using SBT (the Simple Build Tool).

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: objects, classes, inheritance, traits, Lists with multiple related types, apply

Preface

This is part 9 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial is about object-oriented programming with Scala. Most of what we’ve seen so far has been programming with functions and using basic types, like Int, Double, and String, and with predefined types like List and Map. As it turns out, these are all classes, or types of Scala data structures that allow one to create objects, or instances of the type. This tutorial will not give a broad introduction to object-oriented programming, but it will give some practical examples of classes and objects and how to use them. I apologize in advance for some sloppiness in the presentation of object-oriented concepts; the intent is to get across the ideas for beginners mainly through intuitive examples without being mired in lots of technical details. See the Wikipedia page on object-oriented programming for more detail.

Note that the definitions of objects and classes in this tutorial are most easily viewed as plain text, out of the REPL. So, I’ll put a piece of code into the text, and you should add it to your own REPL (by simply cutting and pasting) in order to be able to follow along.

Objects

At its core, an object can be thought of as a structure that encapsulates some data and functions. Let’s start with an an example of an object representing a person and some of their possible attributes.

object JohnSmith {
  val firstName = "John"
  val lastName = "Smith"
  val age = 37
  val occupation = "linguist"

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

If you put this into the Scala REPL, you’ll be able to access the fields (firstName, lastName, age, and occupation) and the functions (fullName and greet).

scala> JohnSmith.firstName
res0: java.lang.String = John

scala> JohnSmith.fullName
res1: String = John Smith

scala> JohnSmith.greet(true)
res2: String = Hello, my name is John Smith. I'm a linguist.

scala> JohnSmith.greet(false)
res3: String = Hi, I'm John!

So, at its most basic level, an object is just that: a collection of values and functions (also often called methods). You can access any of those values or functions by giving the name of the object followed by a period followed by the value or function you want to use. This can be useful for organizing such collections, but it also leads to many more possibilities, as we’ll see.

We might of course be interested in having the information about another person encapsulated in this way. We could do this by mimicking the definition for John Smith.

object JaneDoe {
  val firstName = "Jane"
  val lastName = "Doe"
  val age = 34
  val occupation = "computer scientist"

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

After adding the above code to the REPL, now Jane Doe can greet us.

scala> JaneDoe.greet(true)
res4: String = Hello, my name is Jane Doe. I'm a computer scientist.

scala> JaneDoe.greet(false)
res5: String = Hi, I'm Jane!

Of course, I created the JaneDoe object by doing a copy-and-paste and then replacing the fields with Jane Doe’s information. This leads to a lot of wasted effort: the fields are the same, but the values are different, and the functions are completely identical. If you want to change something about the way greetings are made, you’d have to update it across all of the objects.

More importantly, these two objects are completely distinct from one another: one cannot put them in a list and map a function over that list. Consider the following failed attempt.

scala> val people = List(JohnSmith, JaneDoe)
people: List[ScalaObject] = List(JohnSmith$@698fcb66, JaneDoe$@5f72cbae)

scala> people.map(person => person.firstName)
<console>:11: error: value firstName is not a member of ScalaObject
people.map(person => person.firstName)
                                          ^

The only thing that Scala knowns about JohnSmith and JaneDoe is that they are ScalaObjects. That means that a list of such objects can basically just contain them and allow you to move them around as a group. So, something more is needed to make these collections more useful and more general.

Classes

With the list above, what we’d like to have is a List[Person], where Person is a type that has known fields and functions. We can accomplish this by defining a Person class and then defining John and Jane as members of that class. This also reduces the cut-and-paste duplication problem noted earlier. Here’s what it looks like.

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

The class keyword indicates that this is a class definition and Person is the name of the class. The next part of the definition is a set of parameters to the class that allow us to construct objects that are instances of the class — in other words, they are placeholders that allow us to use the Person class as a factory for creating Person objects. We do this by using the new keyword, giving the name of the class and supplying the values for each of the parameters. For example, here’s how we can create John Smith now.

scala> val johnSmith = new Person("John", "Smith", 37, "linguist")
johnSmith: Person = Person@1979d4fb

Just as we could with the one-off standalone JohnSmith object previously, we can now access the fields and functions.

scala> johnSmith.age
res8: Int = 37

scala> johnSmith.greet(true)
res9: String = Hello, my name is John Smith. I'm a linguist.

Defining other people is now easy, and doesn’t require any cutting-and-pasting.

scala> val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
janeDoe: Person = Person@7ff5376c

scala> val johnDoe = new Person("John", "Doe", 43, "philosopher")
johnDoe: Person = Person@6544c984

scala> val johnBrown = new Person("John", "Brown", 28, "mathematician")
johnBrown: Person = Person@4076a247

These Person objects can now be put into a list together, giving us a List[Person] that allows mapping to retrieve specific values, like first names and ages, and performing computations like calculating the average age of the individuals in the list.

scala> val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
people: List[Person] = List(Person@1979d4fb, Person@7ff5376c, Person@6544c984, Person@4076a247)

scala> people.map(person => person.firstName)
res10: List[String] = List(John, Jane, John, John)

scala> people.map(person => person.age)
res11: List[Int] = List(37, 34, 43, 28)

scala> people.map(person => person.age).sum/people.length.toDouble
res12: Double = 35.5

We can sort them according to age.

scala> val ageSortedPeople = people.sortBy(_.age)
ageSortedPeople: List[Person] = List(Person@4076a247, Person@7ff5376c, Person@1979d4fb, Person@6544c984)

scala> ageSortedPeople.map(person => person.fullName + ":" + person.age)
res13: List[java.lang.String] = List(John Brown:28, Jane Doe:34, John Smith:37, John Doe:43)

We can also group people by first name, last name, etc.

scala> people.groupBy(person => person.firstName)
res14: scala.collection.immutable.Map[String,List[Person]] = Map(Jane -> List(Person@7ff5376c), John -> List(Person@1979d4fb, Person@6544c984, Person@4076a247))

scala> people.groupBy(person => person.lastName)
res15: scala.collection.immutable.Map[String,List[Person]] = Map(Brown -> List(Person@4076a247), Smith -> List(Person@1979d4fb), Doe -> List(Person@7ff5376c, Person@6544c984))

With this, we can have all the Johns greet us.

scala> people.groupBy(person => person.firstName)("John").foreach(john => println(john.greet(true)))
Hello, my name is John Smith. I'm a linguist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

Standalone objects

Above, we saw how to create instances of the Person class by using the new keyword and assigning the resulting object to a variable. We can come back full circle to the first JohnSmith object we created, which was a standalone ScalaObject. We can instead create such a standalone object by extending the Person class.

scala> object ThomYorke extends Person("Thom", "Yorke", 43, "musician")
defined module ThomYorke

scala> ThomYorke.greet(true)
res25: String = Hello, my name is Thom Yorke. I'm a musician.

By extending the Person class to create the object, we are saying that the object is a kind of Person — see more on inheritance below. So, ThomYorke is a Person object, like the others we created, but it is for a different use case that we’ll see more of in the next tutorial. For now, I’ll summarize, very roughly, by saying that the ThomYorke object can be made more accessible by other code that might be using my code, while the johnSmith and janeDoe objects are going to be more locally contained.

Inheritance

The standalone objects lead us naturally to the idea of inheritance. In many domains, there are natural hierachies of types, such that properties of a super type are inherited by its subtypes (e.g. fish have gills and swim, so salmon have gills and swim). For example, we could have a Linguist type that is a kind of Person, a ComputerScientist type that is a kind of Person, and so on. To model this, we create one class that extends another and possibly provides some additional parameters, such as the following definition of a Linguist sub-type of Person.

class Linguist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  val favoriteLanguage: String
) extends Person(firstName, lastName, age, "linguist") {

  def workGreeting =
    "As a " + occupation + ", I am a " + speciality + " who likes to study the language " + favoriteLanguage + "."

}

The Linguist class has its own parameter list: some of these, like firstName, lastName, and age, are passed on to Person, and there are new parameter fields speciality and favoriteLanguage. The extends portion of the definition passes on the relevant parameters needed to construct all the information to make a Person, and for a Linguist, it directly sets the occupation parameter to be “linguist” — thus, we don’t need to provide that when we construct a Linguist, such as Noam Chomsky.

scala> val noamChomsky = new Linguist("Noam", "Chomsky", 83, "syntactician", "English")noamChomsky: Linguist = Linguist@54c0627f

Having defined a Linguist object in this way, we can ask it to give its work greeting.

scala> noamChomsky.workGreeting
res26: java.lang.String = As a linguist, I am a syntactician who likes to study the language English.

We can also access fields and functions of Person objects, like age and greet.

scala> noamChomsky.age
res27: Int = 83

scala> noamChomsky.greet(true)
res28: String = Hello, my name is Noam Chomsky. I'm a linguist.

Of course, the Linguist-specific fields like favoriteLanguage are accessible too.

scala> noamChomsky.favoriteLanguage
res29: String = English

The observant reader will have noticed that some of the parameters are prefaced with val and others are not. We’ll get back to that distinction a bit later.

Traits

We could of course now go on to define a ComputerScientist class that would also have  workGreeting function, but the Linguist.workGreeting and ComputerScientist.workGreeting would be entirely separate. To enable this, we can use traits, which are like classes, but which define an interface of functions and fields that classes can supply concrete values and implementations for.  (Note: traits can also define concrete fields and functions, so they aren’t limited to placeholder functions as we show below.)

As an example, here’s a Worker trait, which simply defines a function workGreeting and declares that it must return a String.

trait Worker {
  def workGreeting: String
}

The Linguist class defined earlier already provides an implementation of that function. To allow a Linguist to be considered as a type of Worker, we add with Worker after extending Person.

class Linguist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  val favoriteLanguage: String
) extends Person(firstName, lastName, age, "linguist") with Worker {

  def workGreeting =
    "As a " + occupation + ", I am a " + speciality + " who likes to study the language " + favoriteLanguage + "."

}

This is called “mixing in” the trait Worker, because the Linguist class mixes in the fields and functions of Worker with those of Person.

Note that we can also create classes that simply extend a trait like Worker.

class Student (school: String, subject: String) extends Worker {
  def workGreeting = "I'm studying " + subject + " at " + school + "!"
}

We can now create a Student object and request their greeting.

scala> val anonymousStudent = new Student("The University of Texas at Austin", "history")
anonymousStudent: Student = Student@734445b5

scala> anonymousStudent.workGreeting
res32: java.lang.String = I'm studying history at The University of Texas at Austin!

Notice that the parameters school and subject were not preceded by val in the definition of Student. That means that they are not member fields of the Student class, which means that they cannot be accessed externally. For example, attempting to access the value provided for school for anonymousStudent fails.

scala> anonymousStudent.school
<console>:11: error: value school is not a member of Student
anonymousStudent.school

Of course, internally, Student can use the values provided to such parameters, for example in defining the result of workGreeting. This sort of encapsulation hides properties of the objects of a class from code that is outside the class; this strategy can help reduce the degrees of freedom available to users of your code so that they only use what you want them to. In general, if others don’t need to use it, you shouldn’t make it available to them.

Returning to classes that are both Persons and Workers, when we define a ComputerScientist, we do a similar extends … with declaration as we did for Linguist.

class ComputerScientist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  favoriteProgrammingLanguage: String
) extends Person(firstName, lastName, age, "computer scientist") with Worker {

  def workGreeting =
    "As a " + occupation + ", I work on " + speciality + ". Much of my code is written in " + favoriteProgrammingLanguage + "."

}

Let’s create Andrew McCallum as a ComputerScientist object.

scala> val andrewMcCallum = new ComputerScientist("Andrew", "McCallum", 44, "machine learning", "Scala")
andrewMcCallum: ComputerScientist = ComputerScientist@493cd5ba

scala> andrewMcCallum.workGreeting
res31: java.lang.String = As a computer scientist, I work on machine learning. Much of my code is written in Scala.

Because we redefined Linguist to be a Worker, we need to recreate Noam Chomsky using the new definition. (The creation looks the same as before, but it uses the new class definition that has been updated in the REPL.)

scala> val noamChomsky = new Linguist("Noam", "Chomsky", 83, "syntactician", "English")
noamChomsky: Linguist = Linguist@6fccaf14

A minor thing to note: the speciality field of ComputerScientist is disconnected from that of Linguist, so there is no particular expectation of consistency of use across the two: for Linguist it is a description of a person working in a sub-area but for ComputerScientist is a description of a sub-area.

So, what happens if we put noamChomsky and andrewMcCallum in a List together?

scala> val professors = List(noamChomsky, andrewMcCallum)
professors: List[Person with Worker] = List(Linguist@6fccaf14, ComputerScientist@493cd5ba)

Scala has created a list with type List[Person with Worker]; this is the most specific type that is valid for all elements of the list. It means we can treat all of the elements as Persons, e.g. accessing their occupation (which is a member field of Person).

scala> professors.map(prof => prof.occupation)
res34: List[String] = List(linguist, computer scientist)

And we can treat each element of the list as a Person and a Worker, e.g. printing out their fullName (from Person) and their workGreeting (from Worker).

scala> professors.foreach(prof => println(prof.fullName + ": " + prof.workGreeting))
Noam Chomsky: As a linguist, I am a syntactician who likes to study the language English.
Andrew McCallum: As a computer scientist, I work on machine learning. Much of my code is written in Scala.

We cannot, however, access fields and functions that are specific to Linguists or ComputerScientists, such as favoriteLanguage from Linguist.

scala> professors.map(prof => prof.favoriteLanguage)
<console>:15: error: value favoriteLanguage is not a member of Person with Worker
professors.map(prof => prof.favoriteLanguage)

It is easy to see why Scala has this behavior: even though that would have been valid for noamChomsky, it would not be for andrewMcCallum (according to the way we defined Linguist and ComputerScientist).

Matching on types in polymorphic Lists

Consider what happens when the anonymousStudent is in a list with the professors.

scala> val workers = List(noamChomsky, andrewMcCallum, anonymousStudent)
workers: List[ScalaObject with Worker] = List(Linguist@6fccaf14, ComputerScientist@493cd5ba, Student@734445b5)

The Person type is gone, and we now have a list of a more general type ScalaObject with Worker. Now we can only use the workGreeting method from Worker.

However, it is worth pointing out that match statements come in handy when you have collections of heterogenous objects. For example, put the following code into the REPL.

val people = List(johnSmith, noamChomsky, andrewMcCallum, anonymousStudent)

people.foreach { person =>
  person match {
    case x: Person with Worker => println(x.fullName + ": " + x.workGreeting)
    case x: Person => println(x.fullName + ": " + x.greet(true))
    case x: Worker => println("Anonymous:" + x.workGreeting)
  }
}

The result is the following (remember that johnSmith was never defined as a Linguist — he was defined as a Person whose occupation is “linguist”).

John Smith: Hello, my name is John Smith. I'm a linguist.
Noam Chomsky: As a linguist, I am a syntactician who likes to study the language English.
Andrew McCallum: As a computer scientist, I work on machine learning. Much of my code is written in Scala.
Anonymous:I'm studying history at The University of Texas at Austin!

So, we can switch our behavior by matching to more specific types using Scala’s pattern matching.

The apply function

Scala provides a simple but incredibly nice feature: if you define an apply function in a class or object, you don’t actually need to write “apply” in order to use it. As an example, the following object adds one to an argument supplied to its apply method.

object AddOne {
  def apply (x: Int): Int = x+1
}

So, we can use it just like you’d normally expect.

scala> AddOne.apply(3)
res41: Int = 4

But, we can also do without the “.apply” portion and get the same result.

scala> AddOne(3)
res42: Int = 4

If a class has an apply method, then we can do the same trick with any object of that class.

class AddN (amountToAdd: Int) {
  def apply (x: Int): Int = x + amountToAdd
}

scala> val add2 = new AddN(2)
add2: AddN = AddN@43ca04a1

scala> add2(5)
res43: Int = 7

scala> val add42 = new AddN(42)
add42: AddN = AddN@83e591f

scala> add42(8)
res44: Int = 50

As it turns out, you’ve been using apply methods quite often, without knowing it! When you have a List and you access an element by index, you’ve used the apply method of the List class.

scala> val numbers = 10 to 20 toList
numbers: List[Int] = List(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

scala> numbers(3)
res46: Int = 13

scala> numbers.apply(3)
res47: Int = 13

Same thing for accessing values using keys in a Map, and similarly for many other of the classes you’ve been using in Scala so far.

Wrap-up

This tutorial has covered the basics of object-oriented programming in Scala. Hopefully, it is enough to give a decent sense of what objects and classes are and how you can do things with them. There is much much more to be learned about them, but this should be sufficient to get you started so that further study can be done meaningfully. It is important to understand these concepts since Scala is object-oriented from the ground up. In fact, in many of the previous tutorials, I’ve at times gone through some extra hoops to try to describe what is going on without having to talk about object-orientation. But now you can see things like Int, Double, List, Map, and so on for what they are: classes that contain particular fields and functions that you can use to get things done. You can now start coding your own classes to enable your own custom behaviors in your applications.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

In loving memory of Belle Scarlett Baldridge
September 29, 2011

I buried my baby daughter Belle today. It wasn’t supposed to be this way. Babies just aren’t supposed to die. We are fortunate to live in a time of favorable survival rates for babies and their mothers. We enjoy high degrees of order and predictability in our day-to-day lives (here in the USA, at least), and it is easy to forget that one still has innocence to lose. This has been the saddest, hardest week of my life. I had always heard that a parent should never have to bury their own child. I didn’t doubt it, but now I know it, fully. This morning, I gazed down at a gaping hole, my little girl’s grave, while I held her casket in my arms. It mirrored the hole already in my heart. It disarmed and terrified me, but also showed me that both were there to receive Belle and preserve her memory.

With this post, I seek to honor and remember Belle, to thank those who have supported us this week, to help myself grieve, and hopefully, to help—perhaps a little—others in the future who must unfortunately deal with the death of their child. My apologies if the post is on the (melo)dramatic side. It’s how I feel, and it seems to be part of my healing process, so please bear with me.

My wife Cheryl and I had long been anticipating Belle’s arrival, with a due date of today – October 4, 2011. Like most expecting parents, we had considered many of the possible outcomes of the pregnancy, including even the possibility of complications that would involve our baby and/or Cheryl needing hospitalization — but never the possibility that our baby Belle wouldn’t make it into this world, never the possibility of a stillbirth. The unyielding march of life and death has left us suddenly and unexpectedly bereft of a person we loved, cared for and were ready to teach and eventually send forth into the world.

We knew Belle from her kicks, and her responses to our voices, songs, and laughter. It’s an imperfect medium of communication, but it suffices to start the relationship that one builds with one’s child — they simply aren’t strangers when you see them for the first time. This is something that can perhaps be hard to understand for those who have not yet had children, and it is a common source of pain for parents of stillborn children: it is somehow perceived by many to not be as great a loss as for those whose children died after their birth date. A great line I read in one of the many materials I’ve been given about such loss is that on a scale of one to ten, the pain of losing a child is always a ten, no matter the age or circumstances. It’s true. I would submit that there is a further dynamic element for parents of a stillborn child: you have gone from a state of accelerating excitement and anticipation, to a huge resounding thud of shock and disbelief. The “what if’s” have in very short order become “never be’s.” This sudden reversal kicks in the first moment you are told that your baby’s heartbeat has stopped and then reverberates as you reel from the pain and try to regroup.

Little Belle is true to her name: she is beautiful, even in death. I can now only imagine what she would have looked like as she grew up, but thankfully I can do at least that. And, I can do that from a starting point of having been able to spend time with her on the day she was born, September 29, 2011. We had a wonderful team with us at Belle’s birth—including doctors, midwives, nurses, and doula—and they helped us through the intensely emotional and difficult process of bringing Belle into the world and, perhaps more importantly, to help us spend meaningful time with her before saying goodbye. They encouraged us to be with Belle, to hold her and take pictures, and not rush things. We now have at least those memories—even so bittersweet—to keep with us, something which many parents of stillborn babies are never given because no one tells them they can and should. This is a really important aspect of Belle’s birth that I hope to get across: you are hurting and spinning from the shock and pain, yet there are important decisions to be made from the very start; while you may have been provided with comprehensive and well-written literature on how to approach the situation, you have little emotional space for it and there is too much of it to possibly work through before you must make decisions.  If you or someone you care for finds themselves unfortunately in this situation, try to get across this message: take time with the baby and take pictures. You won’t get more chances later, and you’ll almost surely regret it if you don’t.

Another important thing for us was to have a small memorial service for Belle, and also a burial. As an agnostic without any religious affiliation, I had no default expectation for what to do. Cheryl and I had years ago decided that cremation would be the thing for us eventually. However, with Belle, Cheryl quickly realized that she wanted a place to visit her, so we went with a burial. I did not feel strongly about it, but it felt right to me when we did it today, so I’ll probably be glad for that choice in the long run. It was very hard to pick out her plot at the cemetery on Friday—it’s an area reserved for infants, a grid of small plots that serves as a concrete reminder of the fragility of the early days of life. Looking at the empty spot where Belle would be buried made it all seem more real, more this-is-really-happening, in the mix of surreal feelings of that day and the previous day. Of course, handing over a credit card to pay for the services and the plot then felt bizarre, an odd juxtaposition of a completely mundane action with the profound grief I was keeping in check. Regardless of that strangeness, it is one of those things which just must be done. Belle is now there, and it is a peaceful place, with trees and birds singing in them.

It turns out that stillbirths are more common than I would have ever thought. I had only directly known of one before Belle, and had assumed it must have been a case of extreme misfortune. Actually, in the USA, the average rate of stillbirths is roughly 1 in 150 births, about 26,000 babies every year. The rate is much higher in developing countries. Despite this prevalence, there apparently is not a great deal of research into it (and it seems to be an inherently difficult thing to research), so we still know little about specific actions that can be taken to prevent it. For the things we do know, such as tangled umbilical cords, there is very little warning — there is a window of perhaps 5-10 minutes from the time of fetal distress in which to save the baby. Knowing this actually relieved us of a great deal of guilt as we had initially second guessed ourselves, retracing our steps in the days leading up to Belle’s birth and imagining ways we could/should have known to try to get her out earlier.

Regardless of the statistics, regardless of whether we’ll know the cause of Belle’s death, it all just ends up feeling unfair. I’ve been robbed of my little girl, whose heart I had heard beating just days before. Belle should have had her fair shot at life, and I’m sure she would have made hers a great one. It shouldn’t have been this way, but that is what happened and now we must live with that and move on. In this, I’m so thankful for the amazing relationship I have with Cheryl. We’re both hurting, immensely, but we also are optimists who have both already overcome our fair share of challenges in our lives. Together, and with the help of family and friends, we’ll regroup and carry on, carrying Belle’s memory with us.

Little Belle, I’ll love you forever.

Addendum

There are many people who have provided us with amazing, and often unexpected, support over the last week.

Our doula, Shelley Scotka, was our shining light on the day of Belle’s birth. Many people have probably never heard of doulas — summarizing quickly, they are amazing women who assist in natural childbirth. They bring their knowledge of traditional birthing techniques and practical experience from many births to bear on yours, including translating what the doctors are saying and doing so that you hear what is going on, in simple, understandable terms. Shelley was there for our son’s delivery, a 50+ hour marathon that she did a great deal to ease. Little did we know that she would be every bit as vital for us for a stillbirth as she was for a live birth. She was a rock who helped before, during and after the delivery, and who continues to shower us with love and care.

We’re also incredibly thankful for the medical team that delivered Belle last Thursday at St. David’s North Austin. Our practice is OB-GYN North, and the midwives, doctor, and technician who had to tell us that Belle’s heartbeat had stopped were caring and kind, and helped us immensely with the initial shock and disbelief. Kathy Harrison-Short, CNM  had caught our son two years before and she immediately came to comfort us. Lisa Carlile, CNM stayed past her shift and was the one who ultimately caught Belle, at Cheryl’s request. Dr. Martha Smitz was the physician on duty that day. She demonstrated tremendous sensitivity, compassion and overwhelming competence throughout. She had an uncanny ability to put us at ease even in the midst of the sorrow and confusion we were going through. The nurses, other doctors, social worker and pastor were all similarly supportive and sensitive. The nurses deserve special thanks for taking such great care of Cheryl before the delivery and of Belle after it. Everyone treated us, and Belle, with tremendous dignity.

Since that day, our family, friends and colleagues have been incredibly supportive. One of the blessings in tragedy is the concrete realization that one is surrounded by a wonderful support network. My younger brother lives here in Austin and my mother had just arrived, ready to help us with Belle; they’ve been helping us through the whole thing, especially with our toddler son, even while dealing with their own loss and grief. My father flew in from Chicago, and my older brother immediately came over from Baton Rouge with his daughter. The sound of her playing with our toddler son over the weekend was a welcome, joyful addition that helped combat the otherwise tendency toward a somber mood. My brother’s wife helped us a great deal from afar, providing support both as a family member and as a practicing physician. My step-father will be here soon, a delayed visit (at my request) since I knew we’d need more backup once the main family contingent was gone.

Other have also given us great strength, including sharing their own pain and anger at the situation, and in a few cases, their own direct experience with stillbirths. There have been generous offers of help, including offers to teach some of my classes in the coming weeks. Though I’ve so far responded to almost none of them, I’ve read and appreciated every email of support from friends, colleagues, and students. In a way, this post is my response, so please consider this my thank you to you all. And to those who I have not yet gotten in touch with about Belle’s death, please understand that there has not been any particular plan or care with my communications regarding it — I’m just now getting geared up to pass the word on to more friends, and some of you are probably seeing this post as a result of that effort.

I must also give high praise to the people at Cook-Walden funeral homes. They have treated us very kindly and have been incredibly responsive to our needs. One of the things about the situation is that many decisions must be made in rapid succession, and you get some of them not-quite-right the first time around. Cook-Walden was very accommodating to changes in how we wanted to do the service and burial and to requests for articles of Belle’s that we only realized later that we’d want (such as a lock of her hair). They treated us and Belle with dignity and allowed us time and space to make decisions and say goodbye to her.

Finally, I must thank the volunteers from Now I Lay Me Down To Sleep, who Shelley called in for us. NILMDTS is a non-profit that has professional photographers who come to take pictures of stillborn babies and their families, and then later retouch them to provide nicer images of the baby than one could generally hope to capture by oneself. They were caring and professional, and we look forward to seeing the result of their work with Belle. If you are looking for a great non-profit to donate to, please consider them.

Topics: scala.io.Source, accessing files, flatMap, mutable Maps

Preface

This is part 8 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial is about accessing the file system in order to work with text files. The previous tutorial showed how to build a Map that contains the counts of each word type in a given text. However, it was assumed that the text was available in a String variable, and typically we are interested in knowing things about files that live on the file system, or on the internet. This tutorial shows how to read a file’s contents into Scala for processing, both by building a single String for the file or by consuming it line-by-line in a streaming fashion. Along the way, immutable Maps are introduced as a way to enable word counting without reading an entire file into memory.

Word count on the contents of a file

As an example, we’ll use the complete Sherlock Holmes from project Gutenberg. Download it, put it into a directory, and then start up the Scala REPL in that directory. To access files, we’ll use the Source class, so to start you need to import it.

scala> import scala.io.Source
import scala.io.Source

Source provides a number of ways to interact with files and make them accessible to you in your Scala program. The fromFile method is the one you’ll probably need most.

scala> Source.fromFile("pg1661.txt")
res3: scala.io.BufferedSource = non-empty iterator

This creates a BufferedSource, from which you can easily get all of file’s contents as a String.

scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many more lines...>

With this, you can do the same things as shown it tutorial 7 to get the word counts (except that here we’ll split on white space sequences rather than just a single space).

scala> val counts = holmes.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(wood-work, -> 1, "Pray, -> 1, herself. -> 2, stern-post -> 1, "Should -> 1, incident -> 8, serious -> 14, earth--" -> 2, sinister -> 10, comply -> 7, breaks -> 1, forgotten -> 3, precious -> 10, 'It -> 3, compliment -> 2, suite, -> 1, "DEAR -> 1, summarise. -> 1, "Done -> 1, fine.' -> 1, lover -> 5, of. -> 2, lead. -> 1, plentiful -> 1, 'Lone -> 4, malignant -> 1, terrible -> 14, rate -> 1, mole -> 1, assert -> 1, lights -> 2, Stevenson, -> 1, submitted -> 4, tap. -> 1, beard, -> 1, band--a -> 1, force! -> 1, snow -> 7, Produced -> 2, ask, -> 1, purchasing -> 1, Hall, -> 1, wall. -> 5, remarked -> 32, laughing -> 4, member." -> 1, 30,000 -> 2, Redistributing -> 1, coat, -> 6, "'One -> 2, 'band,' -> 1, relapsed -> 1, apol...

scala> counts("Holmes")
res2: Int = 197

scala> counts("Watson")
res3: Int = 4

Lest you think it strange that Watson only shows up four times, keep in mind that we split on whitespace, and that means that in a sentence like the following, the token of interest is Watson,” rather than Watson.

“You could not possibly have come at a better time, my dear Watson,” he said cordially.

Looking that and others up shows more tokens containing Watson in the story.

scala> counts("Watson,\"")
res4: Int = 19

scala> counts("Watson,")
res5: Int = 40

scala> counts("Watson.")
res6: Int = 10

Of course, the real problem is that tokenizing on whitespace is too crude. To do this properly generally takes a good hand-built tokenizer (which is able to keep tokens like e.g. and Mr. and Yahoo! while splitting punctuation off most words) or a machine learned one that is trained on data hand-labeled for tokens. For an example of the latter, see the Apache OpenNLP toolkit tokenizers, which includes pre-trained models for English.

Working line by line

Quite often, you need to work through a file line by line, rather than reading the entire thing in as a single string as we did above. For example, you might need to process each line differently, so just having it as a single String isn’t particular convenient. Or, you might be working with a large file that cannot easily fit into memory (which is what happens when you read in the entire string). You can obtain the lines in the file as an Iterator[String], in which each item is a single line from the file, using the getLines method.

scala> Source.fromFile("pg1661.txt").getLines
res4: Iterator[String] = non-empty iterator

This iterator is ready for you to consume lines, but it doesn’t read all of the file into memory right away — instead it buffers it such that each line will be available for you as you ask for it, essentially reading off disk as you demand more lines. You can think of this as streaming the file to your Scala program, much like modern audio and video content is streamed to your computer: it is never actually stored, but is just transferred in parts to where it is needed, when it is needed.

Of course, Iterators share much with sequence data structures like Lists: once we have an Iterator, we can use foreach, for, map, etc. on it. So to print out all of the lines in the file, we can do the following.

scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle
<...many more lines...>

That creates a lot of output, but it shows you how you can easily create your own Scala implementation of the Unix cat program: just save the following line in a file called cat.scala:

scala.io.Source.fromFile(args(0)).getLines.foreach(println)

And then call that with the name of the file to list its contents.

$ scala cat.scala pg1661.txt

Back in the REPL, it is somewhat less-than-ideal to see the entire file. If you just want to see the start of the file, use the take method on the Iterator before the foreach.

scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included

The take method is quite useful in general with any sequence, and provides the complement of the drop method, as shown in the following examples on a simple List[Int].

scala> val numbers = 1 to 10 toList
numbers: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3)
res12: List[Int] = List(1, 2, 3)

scala> numbers.drop(3)
res13: List[Int] = List(4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3) ::: numbers.drop(3)
res14: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Word counting line by line, first try

Now that we’ve seen how to read a file and start working with it line-by-line, how do we count the number of occurrences of each word? Recall from tutorial 7 and above that the starting point was to have a sequence (Array, List, etc) of Strings in which each element is a word token. To start moving toward that, we can simply use the toList method on the Iterator[String] obtained from getLines.

scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file.  Please do not remove it.  Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file.  Included is, important information about your specific rights and restrictions in, how the file may be used.  You can also find ou...

We now have the contents of the file as a List[String], and may proceed to do useful things with it. For example, we could map each line (Strings) to be sequences of whitespace-separated Strings.

scala> val listOfListOfWords = Source.fromFile("pg1661.txt").getLines.toList.map(x => x.split(" ").toList)
listOfListOfWords: List[List[java.lang.String]] = List(List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle), List(""), List(This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with), List(almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or), List(re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included), List(with, this, eBook, or, online, at, www.gutenberg.net), List(""), List(""), List(Title:, The, Adventures, of, Sherlock, Holmes), List(""), List(Author:, Arthur, Conan, Doyle), List(""), List(Posting, Date:, April, 18,, 2011, [EBook, #1661]), List(First, Posted:, November, 29,, 2002), List(""), List(Language:, English), List(""), List(""), List(***, START, OF, THIS, PRO...

And, as we saw in tutorial 7, when we have a List of Lists, we can use flatten to create one big List.

scala> val listOfWords = listOfListOfWords.flatten
listOfWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project, Gut...

But, now you might recognize that this is the map-then-flatten pattern we saw previously, which means we can flatMap it instead.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.toList.flatMap(x => x.split(" "))
flatMappedWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project,...

But you should be a bit bothered by all this: wasn’t the idea here (in part) not to read all of the lines in at once? Indeed, with what we did above, as soon as we said toList on the Iterator, the whole file was read into memory. However, we can do without the toList step and just directly flatMap the Iterator and get a new Iterator over the tokens rather than the lines.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" "))
flatMappedWords: Iterator[java.lang.String] = non-empty iterator

Now, if we want to count the words, we can convert that to a List and do the groupBy the mapValues trick we’ve seen already (output omitted).

scala> val counts = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" ")).toList.groupBy(x=>x).mapValues(x=>x.length)

Oops — that worked, but we once again brought the whole file into memory because the List that was created from toList has all lines for the file. We’ll see next how to use a mutable Map to get around this.

Word counting by streaming with an Iterator and using mutable Maps

In all of the tutorials so far, I’ve pretty much stuck to immutable data structures except when mutable ones show up due to context (like Arrays coming out of the toString method). It’s good to try to make use of immutable data structures where possible, but there are times when mutable ones are more convenient and perhaps more appropriate.

With the immutable Maps we saw in the previous tutorial, you could not change the assignment to a key, nor could you add a new key.

lettersToNumbers: scala.collection.immutable.Map[java.lang.String,Int] = Map(A -> 1, B -> 2, C -> 3)

[sourcecode language="scala"]
scala> lettersToNumbers("A") = 4
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("A") = 4

scala> lettersToNumbers("D") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("D") = 5

There is another kind of Map, scala.collection.mutable.Map, that does allow this sort of behavior.

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val mutableLettersToNumbers = mutable.Map("A"->1, "B"->2, "C"->3)
mutableLettersToNumbers: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, B -> 2, A -> 1)

scala> mutableLettersToNumbers("A") = 4

scala> mutableLettersToNumbers("D") = 5

scala> mutableLettersToNumbers
res4: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 5, B -> 2, A -> 4)

It also has a handy way to increase the count associated with a key, using the += method.

scala> mutableLettersToNumbers("D") += 5

scala> mutableLettersToNumbers
res6: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 10, B -> 2, A -> 4)

However, we can’t use that method with a key that doesn’t exist.

scala> mutableLettersToNumbers("E") += 1
java.util.NoSuchElementException: key not found: E
<...stacktrace...>

Fortunately, we can provide a default. Here’s an example of starting a new Map with a default of 0.

scala> val counts = mutable.Map[String,Int]().withDefault(x=>0)
counts: scala.collection.mutable.Map[String,Int] = Map()

scala> counts("Z") += 1

scala> counts("Y") += 1

scala> counts("Z") += 1

scala> counts
res11: scala.collection.mutable.Map[String,Int] = Map(Z -> 2, Y -> 1)

Note: when you start with some values already in a Map, Scala can infer the types of the keys and the values, but when initializing an empty Map, it is necessary to explicitly declare the key and value types.

With this in hand, here is how we can use flatMap plus a mutable Map to count words in a text without reading the entire text into memory.

import scala.collection.mutable
val counts = mutable.Map[String, Int]().withDefault(x=>0)
for (token <- scala.io.Source.fromFile("pg1661.txt").getLines.flatMap(x =>x.split("\\s+")))
counts(token) += 1

Having created the counts Map in this way, we can convert it to an immutable Map with the toMap method once we are done adding elements.

scala> val fixedCounts = counts.toMap
fixedCounts: scala.collection.immutable.Map[String,Int] = Map(wood-work, -> 1,
<...output truncated...>

Now we can’t modify the values on fixedCounts, which has advantages in many contexts, e.g. we can’t accidentally destroy values or add unwanted keys, and there are (positive) implications for parallel processing.

scala> fixedCounts("Holmes") = 0
<console>:13: error: value update is not a member of scala.collection.immutable.Map[String,Int]
fixedCounts("Holmes") = 0
^

Reading a file from a URL

As it turns out scala.io.Source can do a lot more than read from a file. Another example is to read from a URL to access a file on the internet, using the fromURL method.

val holmesUrl = """http://www.gutenberg.org/cache/epub/1661/pg1661.txt"""
for (line <- Source.fromURL(holmesUrl).getLines)
println(line)

If you are just going to analyze the same file again and again, this is probably not what you need — just download the file and use it locally. However, it can be quite useful in contexts where you are exploring links within pages (e.g. while processing Wikipedia or Twitter data) and need to read in content from URLs on the fly.

Use (up) the Source

A final note on the Iterators you get with Source.fromFile and Source.fromURL: you can only iterate through them once! This is part of what makes them more efficient — they aren’t holding all thattext in memory. So, don’t be surprised if you get the following behavior.


scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
 holmesIterator: Iterator[String] = non-empty iterator

scala> holmesIterator.foreach(println)

Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever.  You may copy it, give it away or
 re-use it under the terms of the Project Gutenberg License included
 with this eBook or online at www.gutenberg.net

<...many lines of output...>

This Web site includes information about Project Gutenberg-tm,
 including how to make donations to the Project Gutenberg Literary
 Archive Foundation, how to help produce our new eBooks, and how to
 subscribe to our email newsletter to hear about new eBooks.

scala> holmesIterator.foreach(println)

<...nothing output!...>

So, the Iterator is used up! If you want to go through the file again, you’ll need to spin up a new Iterator just like you did the first time around. The neat thing about staying with the Iterators and not converting to Lists (and thus bringing everything into memory) is that each mapping operation we do on the Iterator applies only for the current item we are looking at, so we never need to read the whole file into memory.

Of course, if you have a reasonably small file to work with, you should feel absolutely free to toList it and work with it that way if you prefer — it will often be more convenient since you can do the groupBy and mapValue pattern.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: Maps, Sets, groupBy, Options, flatten, flatMap

Preface

This is part 7 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

Lists (and other sequence data structures, like Ranges and Arrays) allow you to group collections of objects in an ordered manner: you can access elements of a list by indexing their position in the list, or iterate over the list elements, one by one, using for expressions and sequence functions like map, filter, reduce and fold. Another important kind of data structure is the associative array, which you’ll come to know in Scala as a Map. (Yes, this has the unfortunate ambiguity with the map function, but their use will be quite clear from context.) Maps allow you to store a collection of key-value pairs and to access the values by the keys associated with them, rather than via an index (as with a List).

Example cases where you could use a Map:

  • Associating English words with their German translations
  • Associating each word with its count in a given text
  • Associating each word with its possible parts-of-speech

You’ll see concrete examples of each of these in this post.

Creating Maps and accessing their elements

Maps are quite intuitive to grasp. Here’s an example with a few English words and their German translations. One easy way of creating a Map is by passing in a list of pairs, where the first element of each pair defines a key and the second defines a corresponding value.

scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

Notice that the Map entries are of the form key -> value. We may then retrieve the German translation for dog by providing the key “dog” to the Map we created.

scala> engToDeu("dog")
res0: java.lang.String = Hund

Think for a moment what you would have to do to accomplish this with Lists. You’d need need two Lists, one for each language, and they’d need to be aligned so that each element in one list corresponded to its translation in the other list.

scala> val engWords = List("dog","cat","rhinoceros")
engWords: List[java.lang.String] = List(dog, cat, rhinoceros)

scala> val deuWords = List("Hund","Katze","Nashorn")
deuWords: List[java.lang.String] = List(Hund, Katze, Nashorn)

Then, to find the translation of cat, you would have to find the index of cat in engWords, and then look up that index in deuWords.

scala> engWords.indexOf("cat")
res2: Int = 1

scala> deuWords(engWords.indexOf("cat"))
res3: java.lang.String = Katze

This is actually quite inefficient, as well as having other problems. Maps are the right thing for what we want here, and they do they job of retrieving values for keys quite efficiently.

It turns out that we can take two lists that are aligned in this way and construct a Map very easily. Recall that zipping two lists together creates one list of pairs, where each pair gives the elements that shared the same index.

scala> engWords.zip(deuWords)
res4: List[(java.lang.String, java.lang.String)] = List((dog,Hund), (cat,Katze), (rhinoceros,Nashorn))

By calling the toMap method on such a List of pairs, we get a Map.

scala> engWords.zip(deuWords).toMap
res5: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

Note that even though the REPL is showing the order of the key-value pairs to be the same as the original list we constructed the map from, there is no inherent order to the elements of a Map.

You can add elements to a Map to create a new Map using the + operator and an arrow -> between each key and value pair.


scala> engToDeu + "owl" -> "Eule"
res6: (java.lang.String, java.lang.String) = (Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)owl,Eule)

scala> engToDeu + ("owl" -> "Eule", "hippopotamus" -> "Nilpferd")
res7: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

You can add one Map to another using the ++ operator.


scala> val newEntries = Map(("hippopotamus", "Nilpferd"),("owl","Eule"))
newEntries: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(hippopotamus -> Nilpferd, owl -> Eule)

scala> val expandedEngToDeu = engToDeu ++ newEntries
expandedEngToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

You can do the same by passing in a List of tuples to the ++ operator.


scala> engToDeu ++ List(("hippopotamus", "Nilpferd"),("owl","Eule"))
res8: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

And you can remove a key from a Map with the – operator.


scala> engToDeu - "dog"
res9: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(cat -> Katze, rhinoceros -> Nashorn)

See the Map API for more examples of such functions. Note: throughout this post, I’m sticking to immutable Maps — if you are looking at any other tutorials and are wondering why certain methods from those aren’t working here, they may have been using mutable Maps, which we’ll discuss later.

If we ask for the value associated with a key that doesn’t exist in the Map, we get an error.

scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird
at scala.collection.MapLike$class.default(MapLike.scala:224)
(etc.)

You can check for whether a key is in the Map using the contains method.

scala> engToDeu.contains("bird")
res10: Boolean = false

scala> engToDeu.contains("dog")
res11: Boolean = true

Let’s say you had a list of English words and wanted to look up their corresponding German words, and you want to protect yourself against the NoSuchElementException. One way to do this is to filter the words using contains, and then map the remaining ones through engToDeu.

scala> val wordsToTranslate = List("dog","bird","cat","armadillo")
wordsToTranslate: List[java.lang.String] = List(dog, bird, cat, armadillo)

scala> wordsToTranslate.filter(x=>engToDeu.contains(x)).map(x=>engToDeu(x))
res12: List[java.lang.String] = List(Hund, Katze)

This is a useful ways of safely applying a Map to a list of items. However, we’ll see a better way to deal with missing values later on, using Options.

If you there is a sensible default value for any key you might try with your map, you can use the getOrElse method. You provide the key as the first argument, and then the default value as the second.


scala> engToDeu.getOrElse("dog","???")
res1: java.lang.String = Hund

scala> engToDeu.getOrElse("armadillo","???")
res2: java.lang.String = ???

It is quite common to use getOrElse with a default of 0 for Maps that contain statistics, such as word counts (see below), where the absence of a key naturally indicates that it has, e.g., a count of zero.

If you have a consistent default value for any keys that aren’t in the Map, you can set it by using the withDefault method.


scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn")).withDefault(x => "???")
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("armadillo")
res3: java.lang.String = ???

Now you can ask for values in the usual manner, without needing to use getOrElse and providing the default every time.

Keys and values in Maps

You may have observed that Scala tells you more than that you have just created a Map. Like List, Map is a parameterized type, which means that it is a generic way of collecting a bunch of objects of particular types together. Above we saw an instance of a Map[String, String] (leaving off the java.lang part to make it clearer). The first String indicates that the keys are strings and the second that values are Strings. Basically, any type can be used in either position (warning: you should avoid using mutable data structures as keys unless you know what you are doing). Here are some examples (try to ignore the scala.collection.immutable and java.lang parts and just focus on the Map[X,Y] signatures we get).

scala> Map((10,"ten"), (100,"one hundred"))
res0: scala.collection.immutable.Map[Int,java.lang.String] = Map(10 -> ten, 100 -> one hundred)

scala> Map(("a",1),("b",2))
res1: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

scala> Map((1,3.14), (2,6.28))
res2: scala.collection.immutable.Map[Int,Double] = Map(1 -> 3.14, 2 -> 6.28)

scala> Map((("pi",1),3.14), (("tau",2),6.28))
res3: scala.collection.immutable.Map[(java.lang.String, Int),Double] = Map((pi,1) -> 3.14, (tau,2) -> 6.28)

scala> Map(("the",List("Determiner")),("book",List("Verb","Noun")),("off",List("Preposition","Verb")))
res4: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] = Map(the -> List(Determiner), book -> List(Verb, Noun), off -> List(Preposition, Verb))

The last two examples show some very useful aspects of key and values types that allow you to use more complex keys and values. The former uses a (String, Int) pair as a key, with signature Map[(String, Int), Double], and the latter uses a List[String] as the value, with signature Map[String, List[String]]. So you can bundle together several types using tuples and you can use parameterized data structures to parameterize another data structure.

A simple translation task

Here is a mini German/English dictionary as a Map.

scala> val miniDictionary = Map(("befreit","liberated"),("baeche","brooks"),("eise","ice"),("sind","are"),("strom","river"),("und","and"),("vom","from"))
miniDictionary: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(und -> and, eise -> ice, sind -> are, befreit -> liberated, strom -> river, vom -> from, baeche -> brooks)

We can provide a (very bad) translation of the German sentence “vom eise befreit sind strom und baeche” using this dictionary: we simply split the German sentence and then map over its elements, looking up each word in the dictionary.

scala> val example = "vom eise befreit sind strom und baeche"
example: java.lang.String = vom eise befreit sind strom und baeche

scala> example.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
res0: String = from ice liberated are river and brooks

Okay, not quite “from the ice they are freed, the stream and brook” but then again it’s pretty much the dumbest machine translation approach available…

A danger of course is that we will have words that aren’t in the dictionary, leading to an exception.

scala> val example2 = "vom eise befreit sind strom und schiffe"
example2: java.lang.String = vom eise befreit sind strom und schiffe

scala> example2.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
java.util.NoSuchElementException: key not found: schiffe

We’ll return to this below.

Creating Maps from Lists using groupBy

We frequently have data stored in a particular data structure and would like to work with it using another data structure that organizes the data points in some other manner. Here, we’ll look at how to convert a List into Map using the groupBy method in order to do some useful processing for working with parts-of-speech. We’ll also see the Set data structure along the way.

We’ll start with a very basic example of what groupBy does. Given a list of number tokens, we can obtain a Map from the number types to all of the tokens of each number.

scala> val numbers = List(1,4,5,1,6,5,2,8,1,9,2,1)
numbers: List[Int] = List(1, 4, 5, 1, 6, 5, 2, 8, 1, 9, 2, 1)

scala> numbers.groupBy(x=>x)
res19: scala.collection.immutable.Map[Int,List[Int]] = Map(5 -> List(5, 5), 1 -> List(1, 1, 1, 1), 6 -> List(6), 9 -> List(9), 2 -> List(2, 2), 8 -> List(8), 4 -> List(4))

As you can see from the result, groupBy took the anonymous function x=>x, grouped all of the elements of the List that have the same value of x, and then created a Map from each x to the group containing its tokens. So, we get 2 mapping to a List containing 2′s, and so on. This probably seems a bit weird, but it is incredibly useful when we consider Lists that have more interesting elements in them. To do so, let’s go back to the part-of-speech tagging example from Part 4 of these tutorials. Say we have a sentence that is tagged with parts of speech, such as the following (made up) example that ensures some tag ambiguities.

in the dark , a tall man saw the saw that he needed to man to cut the dark tree .

The parts-of-speech could be annotated as follows (with lots of simplifications, and apologies to any offense caused to anyone’s linguistic sensitivities).

in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc

See Part 4 for detailed explanation of how the following expression turns a string like this into a List of tuples.

scala> val tagged = "in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc".split(" ").toList.map(x => x.split("/")).map(x => (x(0), x(1)))
tagged: List[(java.lang.String, java.lang.String)] = List((in,Prep), (the,Det), (dark,Noun), (,,Punc), (a,Det), (tall,Adjective), (man,Noun), (saw,Verb), (the,Det), (saw,Noun), (that,Pronoun), (he,Pronoun), (needed,Verb), (to,Prep), (man,Verb), (to,Prep), (cut,Verb), (the,Det), (dark,Adjective), (tree,Noun), (.,Punc))

Now, let’s use groupBy in various ways on this. The first thing we might be interested in is seeing which parts of speech each word is associated with.

scala> val groupedTagged = tagged.groupBy(x => x._1)
groupedTagged: scala.collection.immutable.Map[java.lang.String,List[(java.lang.String, java.lang.String)]] = Map(in -> List((in,Prep)), needed -> List((needed,Verb)), . -> List((.,Punc)), cut -> List((cut,Verb)), saw -> List((saw,Verb), (saw,Noun)), a -> List((a,Det)), man -> List((man,Noun), (man,Verb)), that -> List((that,Pronoun)), dark -> List((dark,Noun), (dark,Adjective)), to -> List((to,Prep), (to,Prep)), , -> List((,,Punc)), tall -> List((tall,Adjective)), he -> List((he,Pronoun)), tree -> List((tree,Noun)), the -> List((the,Det), (the,Det), (the,Det)))

So, now you see that the keys in the Map constructed by groupBy are the words and the values are the groups of the original elements. You can then see that the anonymous function x => x._1 provided to groupBy does two things: it specifies the part of the input elements that will group different items together and it specifies that that part of the input defines the key space.

However, we don’t quite have what we want, which is to have the set of parts of speech associated with each word. Instead we have a List of tuples, e.g.:

scala> groupedTagged("saw")
res21: List[(java.lang.String, java.lang.String)] = List((saw,Verb), (saw,Noun))

Focussing on just this for a moment, we can map this and produce a List with just the parts-of-speech, and then turn that List into a Set with the toSet method in order to get just the unique parts-of-speech.

scala> groupedTagged("saw").map(x=>x._2)
res24: List[java.lang.String] = List(Verb, Noun)

scala> groupedTagged("saw").map(x=>x._2).toSet
res25: scala.collection.immutable.Set[java.lang.String] = Set(Verb, Noun)

Converting the List to a Set didn’t do much here, but consider the, which has multiple tokens with the same part-of-speech.

scala> groupedTagged("the")
res26: List[(java.lang.String, java.lang.String)] = List((the,Det), (the,Det), (the,Det))

scala> groupedTagged("the").map(x=>x._2)
res27: List[java.lang.String] = List(Det, Det, Det)

scala> groupedTagged("the").map(x=>x._2).toSet
res28: scala.collection.immutable.Set[java.lang.String] = Set(Det)

Sets are yet another of the useful data structures you have to work with, along with Maps and Lists. They work just like you would expect Sets to: they contain a collection of unique, unordered elements, and they allow you to see whether an element is in the set, whether one set is a subset of another, iterate over their elements, etc.

Now, back to getting from the word/tag pairs to a mapping from words to possible tags for each word. The keys we got from tagged.groupBy(x => x._1)  are what we want, but we want to transform the values from Lists of word/tag tokens to Sets of tags, which we can do with the mapValues method on Maps.

scala> val wordsToTags = tagged.groupBy(x => x._1).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._2).toSet)
wordsToTags: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(in -> Set(Prep), needed -> Set(Verb), . -> Set(Punc), cut -> Set(Verb), saw -> Set(Verb, Noun), a -> Set(Det), man -> Set(Noun, Verb), that -> Set(Pronoun), dark -> Set(Noun, Adjective), to -> Set(Prep), , -> Set(Punc), tall -> Set(Adjective), he -> Set(Pronoun), tree -> Set(Noun), the -> Set(Det))

The bit inside the mapValues(…) part will have some readers scrunching up their eyes, but you just need to look at the line where we got res28 above: if you understood that, then you just need to realize we are doing exactly the same thing, but now in the context of mapping over the values rather than dealing with a single value. Now you know how to map over values that you are mapping over.

Now that it is hand, we can easily query the wordsToTags Map to see whether various words have various tags.

scala> wordsToTags("man")("Noun")
res8: Boolean = true

scala> wordsToTags("man")("Det")
res9: Boolean = false

scala> wordsToTags("man")("Verb")
res10: Boolean = true

scala> wordsToTags("saw")("Verb")
res11: Boolean = true

This is an example of how data structures within data structures (here Sets within a Map) are quite useful. (Exercise: think about what a tree is for a moment and how you might implement it using Lists.)

There are a variety of things you can do in computational linguistics with Maps from words to their parts-of-speech. A simple example is to compute the average number of tags per word type.

scala> val avgTagsPerType = wordsToTags.values.map(x=>x.size).sum/wordsToTags.size.toDouble
avgTagsPerType: Double = 1.2

If it isn’t clear to you what is going on here, tease it apart in your own REPL!

We can turn our word/tag pairs the other way to find out which words go with each part-of-speech. The only thing we need to do is groupBy on the second element of each pair, and then map the List values to their first element and get a Set from those.

scala> val tagsToWords = tagged.groupBy(x => x._2).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._1).toSet)
tagsToWords: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(Prep -> Set(in, to), Det -> Set(the, a), Noun -> Set(dark, man, saw, tree), Pronoun -> Set(that, he), Verb -> Set(saw, needed, man, cut), Punc -> Set(,, .), Adjective -> Set(tall, dark))

This basic paradigm is a powerful one for flipping between different data structures depending on what our needs are. It also demonstrates several important concepts with working with Lists, Maps and Sets. The next section shows a simple application of this idea for counting words in a text.

Counting words

A common task in computational linguistics is to calculate word statistics, and the most basic of those is to count the number of tokens of each word type in a particular text. The most common way to store and access those counts is in a Map, but how do you create such a Map from a given text? If we look at a text as a list of strings, then the groupBy paradigm we did above gives us exactly what we need — in fact it is even simpler than the word/tag manipulations done above.

The example text we’ll use is the tongue-twister about woodchucks.

scala> val woodchuck = "how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood ."
woodchuck: java.lang.String = how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood .

Given this, here’s how we can compute the number of occurrences of each word type. First we groupBy on the elements. Though a list of strings isn’t as interesting as having a list of Tuples as we had with words and tags, it still produces a useful result: we now have a unique set of keys corresponding to the types of elements found in the Array, and there is a corresponding value to each one that is the Array of tokens of that type.

scala> woodchuck.split(" ").groupBy(x=>x)
res29: scala.collection.immutable.Map[java.lang.String,Array[java.lang.String]] = Map(woodchuck -> Array(woodchuck, woodchuck, woodchuck, woodchuck), chuck -> Array(chuck, chuck, chuck), . -> Array(.), would -> Array(would), if -> Array(if, if), a -> Array(a, a, a, a), as -> Array(as, as), , -> Array(,), how -> Array(how), much -> Array(much, much), wood -> Array(wood, wood, wood, wood), ? -> Array(?), could -> Array(could, could, could))

And, we want to do something much simpler than what we did with the part-of-speech example: we just need to count the length of each list, since they each contain every token of the corresponding word type. The function passed to mapValues is thus quite a bit simpler than the ones given in the previous section.

scala> val counts = woodchuck.split(" ").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(woodchuck -> 4, chuck -> 3, . -> 1, would -> 1, if -> 2, a -> 4, as -> 2, , -> 1, how -> 1, much -> 2, wood -> 4, ? -> 1, could -> 3)

With counts, we can now access the frequencies of any of the words that were in the text.

scala> counts("woodchuck")
res5: Int = 4

scala> counts("could")
res6: Int = 3

Easy!  Of course, we normally want to build word counts for texts that are longer and are stored in a file rather than explicitly added to Scala code. The next tutorial will demonstrate how to do that.

Iterating over the keys and values in a Map

The material above shows some useful aspects of Maps, but of course there is much more you can do with them, often requiring iterating through the key-value pairs in the Map. We’ll use the counts Map created above for demonstrating this.

You can access just the keys, or just the values.

scala> counts.keys
res0: Iterable[java.lang.String] = Set(woodchuck, chuck, ., would, if, a, as, ,, how, much, wood, ?, could)

scala> counts.values
res1: Iterable[Int] = MapLike(4, 3, 1, 1, 2, 4, 2, 1, 1, 2, 4, 1, 3)

Notice that these are both Iterable data structures, so we can do all of the usual mapping, filtering, and so on, that we have already done with lists. (You may convert them to Lists if you like using toList, of course.)

You can print out all of the key -> value pairs in the Map in a number of ways. One is to use a for expression.

scala> for ((k,v) <- counts) println(k + " -> " + v)
woodchuck -> 4
chuck -> 3
. -> 1
would -> 1
if -> 2
a -> 4
as -> 2
, -> 1
how -> 1
much -> 2
wood -> 4
? -> 1
could -> 3

And here are other ways to achieve the same result (output omitted since it is the same).

for (k <- counts.keys) println(k + " -> " + counts(k))
counts.map(kvPair => kvPair._1 + " -> " + kvPair._2).foreach(println)
counts.keys.map(k => k + " -> " + counts(k)).foreach(println)
counts.foreach { case(k,v) => println(k + " -> " + v) }
counts.foreach(kvPair => println(kvPair._1 + " -> " + kvPair._2))

And so on. Basically, you are able to step through the Map one key-value pair at a time, or you can grab the set of keys and then step through those and access the values from the map. Which form you use depends on what you need — for example, the foreach construct doesn’t return a value, but the for expressions and the map expressions do return values. Why would you do that? Well, as an example, consider grouping all words that have occurred the same number of times.

scala> val countsToWords = counts.keys.toList.map(k => (counts(k),k)).groupBy(x=>x._1).mapValues(x=>x.map(y=>y._2))
countsToWords: scala.collection.immutable.Map[Int,List[java.lang.String]] = Map(3 -> List(chuck, could), 4 -> List(woodchuck, a, wood), 1 -> List(., would, ,, how, ?), 2 -> List(if, as, much))

We go from a Map to a Set of its keys to a List of those keys to a List of Tuples of the values and the keys to a Map from the values of the original Map to such Tuples, and then we map the values of the new map to just contain the words (the original keys). (That’s a mouthful, so try each step in the REPL to see what is going on in detail.)

Now we can output countsToWords sorted in descending numerical order by count, and then by alphabetical order by word within each count.

scala> countsToWords.keys.toList.sorted.reverse.foreach(x => println(x + ": " + countsToWords(x).sorted.mkString(",")))
4: a,wood,woodchuck
3: chuck,could
2: as,if,much
1: ,,.,?,how,would

Options and flatMapping for dealing with missing keys

I pointed out toward the start of this tutorial that we run into trouble if we ask for a key that doesn’t exist in a Map. Let’s go back to the engToDeu Map we began with.

scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("dog")
res0: java.lang.String = Hund

scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird

There is another way of accessing the elements of a Map, using the get method.

scala> engToDeu.get("dog")
res2: Option[java.lang.String] = Some(Hund)

scala> engToDeu.get("bird")
res3: Option[java.lang.String] = None

Now, the return value is an Option[String]. An Option is either a Some that contains a value or a None, which means there is no value. If you want to get the value out of a Some, you use the get method on Options.

scala> val dogTrans = engToDeu.get("dog")
dogTrans: Option[java.lang.String] = Some(Hund)

scala> dogTrans.get
res4: java.lang.String = Hund

If you just use get on a Map to obtain an Option and then immediately call get on the Option, we get the same behavior we had before.

scala> engToDeu.get("dog").get
res6: java.lang.String = Hund

scala> engToDeu.get("bird").get
java.util.NoSuchElementException: None.get

So, at this point, you are probably thinking that this sounds like a waste of time that is just making things more complex. Wait! It actually is tremendously useful because of pattern matching and the way many methods on sequences work.

First, here is how you can write a protected form of translating the words in a list without getting an exception.

scala> wordsToTranslate.foreach { x => engToDeu.get(x) match {
|   case Some(y) => println(x + " -> " + y)
|   case None =>
| }}
dog -> Hund
cat -> Katze

I know… this probably still isn’t convincing — it still looks more involved than the conditional we used (far) above to check whether engToDeu contained a given key (at least for this particular example). Hold on… because now we are just about ready for things to get simpler, and learn some useful things about Lists in doing so.

First, you should know about a great method on Lists called flatten. If you have a List of Lists of Strings, you can use flatten to get a single List of Strings. Consider the following example, in which we flatten a List of Lists of Strings and make a single String out of the result with mkString. Notice that the empty List in the third spot of the main List just disappears when we flatten it.

scala> val sentences = List(List("Here","is","sentence","one","."),List("The","third","sentence","is","empty","!"),List(),List("Lastly",",","we","have","a","final","sentence","."))
sentences: List[List[java.lang.String]] = List(List(Here, is, sentence, one, .), List(The, third, sentence, is, empty, !), List(), List(Lastly, ,, we, have, a, final, sentence, .))

scala> sentences.flatten
res0: List[java.lang.String] = List(Here, is, sentence, one, ., The, third, sentence, is, empty, !, Lastly, ,, we, have, a, final, sentence, .)

scala> sentences.flatten.mkString(" ")
res1: String = Here is sentence one . The third sentence is empty ! Lastly , we have a final sentence .

Flattening in general is pretty useful in its own right. Where it comes to play with Option values is that Options can be thought of a Lists: Somes are like one element Lists and Nones are like empty Lists. So, when you have a List of Options, the flatten method gives you the value in a Some and any Nones just drop away.

scala> wordsToTranslate.map(x => engToDeu.get(x))
res12: List[Option[java.lang.String]] = List(Some(Hund), None, Some(Katze), None)

scala> wordsToTranslate.map(x => engToDeu.get(x)).flatten
res13: List[java.lang.String] = List(Hund, Katze)

This is such a generally useful paradigm that there is a function flatMap which does exactly this.

scala> wordsToTranslate.flatMap(x => engToDeu.get(x))
res14: List[java.lang.String] = List(Hund, Katze)

So, returning to the translation example above, we can now safely skip on by “schiffe” without fuss.

scala> example2.split(" ").flatMap(deuWord => miniDictionary.get(deuWord)).mkString(" ")
res15: String = from ice liberated are river and

Whether this is the desired behavior in this particular case is another question (e.g. you really should be doing some special unknown word handling). Nonetheless, you’ll find that flatMap is quite handy in general for this sort of pattern, in which a list of elements is used to retrieve values from a Map that will be missing some of those values.

An example of the further use of Options and flatMap is that you also may create functions that return Options and are thus amenable to flatMapping. Consider a function that squares only odd numbers and throws evens away (note: the % operator is the modulo operator that finds the remainder of division of one number by another — try it in the REPL).


scala> def squareOddNumber (x: Int) = if (x % 2 != 0) Some(x*x) else None
squareOddNumber: (x: Int)Option[Int]

If you map over the numbers 1 to 10, you’ll see the Somes and Nones, and if you flatMap it, you get exactly the desired result of the squares of all the odd numbers without any pollution from the evens.

scala> (1 to 10).toList.map(x=>squareOddNumber(x))
res16: List[Option[Int]] = List(Some(1), None, Some(9), None, Some(25), None, Some(49), None, Some(81), None)

scala> (1 to 10).toList.flatMap(x=>squareOddNumber(x))
res17: List[Int] = List(1, 9, 25, 49, 81)

This turns out to be amazingly useful and common, so much so that the expression “just flatMap that shit” has become a common refrain among Scala programmers. Scala programmers even write scripts to remind them to do it. :)

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: regular expressions, matching and substitutions with the scala.util.matching API

Preface

This is part 6 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post is the second of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. If you haven’t read it already, you might want to start with the first post about regexes. For what its worth, this post might actually be of some use to programmers who already are reasonably familiar with Scala but who haven’t used regular expressions much yet: it might saving some poking around to figure out how to do things you already know how to do quite well in other languages.

The use of regular expressions for capturing values for variable assignment and cases in match expressions is a very clean, well-thought out and highly useful trait of support for regular expressions in the Scala language. However, their use for more complex string matching and substitution is, frankly, much less straightforward than it is in languages with built-in support for regular expressions, such as Perl (which—speaking as one who has coded a lot in Perl—you do *not* want to use for general programming). Scala is fully capabable in that you can use regular expressions fully, but you’ll need to use it via the Regex API. In other words, you need to use a number of commands, not all of which as as straightforward as they could be. (This is not a rant, though I do obviously wish regular expressions were supported more naturally in Scala.)

Though I’ll refer to what I’m doing below as using the Regex API, I’ll note first that this makes it sound like a bigger deal than it is. It just means you are directly using classes and objects from the scala.util.matching package rather than using some of the special syntax and integration with Scala pattern matching we saw in the previous post.

More extensive matching

First off, let’s do what we did with pattern matching in the previous post, but now using the Regex class and the methods available to it to achieve the same ends. We can then start working with finding multiple matches and performing substitutions.

To recap, recall the name regular expression and how we can use it to initialize a group of variables based on matching a given string.

scala> val Name = """(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val smith = "Mr. John Smith"
smith: java.lang.String = Mr. John Smith

scala> val Name(title, first, last) = smith
title: String = Mr
first: String = John
last: String = Smith

Instead of doing it this way, let’s instead use the API methods. We start by using the regex to find the matches, if any. The method findAllIn of Regex does this for us.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

The result is an iterator, which is an object that is like a list in that you can iterate over its elements with for expressions and foreach, use map to transform its values, and more.

scala> matchesFound.foreach(println)
Mr. John Smith

However, unlike Lists, you can only do this a single time. As the following shows, after you iterate through it once, its elements are used up.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

scala> matchesFound.foreach(println)
Mr. John Smith

scala> matchesFound.foreach(println)

Another difference is that you cannot index into its elements directly.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

scala> matchesFound(0)
<console>:11: error: scala.util.matching.Regex.MatchIterator does not take parameters
matchesFound(0)
^

If you wish to do that, you need to just call toList on the MatchIterator.

scala> val matchList = Name.findAllIn(smith).toList
matchList: List[String] = List(Mr. John Smith)

scala> matchList.foreach(println)
Mr. John Smith

scala> matchList.foreach(println)
Mr. John Smith

I’ll primarily work with the match results as a List for the remainder of this tutorial. However, note that when you are programming, you should consider whether you really need to do this—usually, the iterator will be sufficient and it has the advantage of being a more efficient.

Note above that what we have is a List[String]. That means we can see which portions of a string matched, which could include multiple matches.

scala> val sentence = "Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown."
sentence: java.lang.String = Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown.

scala> val matchList = Name.findAllIn(sentence).toList
matchList: List[String] = List(Mr. John Smith, Ms. Jane Hill, Mr. Bill Brown)

This will be useful in many contexts, but it won’t allow us to access the match groups that were defined in the Regex. For that, we need to use the matchData method, which converts the MatchIterator (which offers Strings as its elements) into an Iterator[Match] (which offers Match objects as its elements).

scala> val matchList = Name.findAllIn(smith).matchData
matchList: java.lang.Object with Iterator[scala.util.matching.Regex.Match] = non-empty iterator

Let’s convert that to a List and then grab the first element.

scala> val matchList = Name.findAllIn(smith).matchData.toList
matchList: List[scala.util.matching.Regex.Match] = List(Mr. John Smith)

scala> val firstMatch = matchList(0)
firstMatch: scala.util.matching.Regex.Match = Mr. John Smith

This Match object contains captured groups that we can access with the group method. The first index, 0, returns the entire match, and the rest access the captured groups.

scala> firstMatch.group(0)
res8: String = Mr. John Smith

scala> val title = firstMatch.group(1)
title: String = Mr

scala> val first = firstMatch.group(2)
first: String = John

scala> val last = firstMatch.group(3)
last: String = Smith

We can get a bit closer to the original pattern matched variable assignment by packaging them up as a tuple.

scala> val (title, first, last) = (firstMatch.group(1), firstMatch.group(2), firstMatch.group(3))
title: String = Mr
first: String = John
last: String = Smith

Update: There is a more concise way to do this using the range 1 to 3 and map firstMatch.group over that range. This creates a Seq(uence), which we can pattern match on. (Thanks to @missingfaktor.)


val Seq(title, first, last) = 1 to 3 map firstMatch.group

This should demonstrate why Scala’s support for Regexes in patterning match is very nice for this. What you gain with the API is the ability to match multiple instances of a pattern in a string and then to perform computations with the Match results on the fly. For example, let’s return to the sentence with multiple names in it and use the Name regex to say hello to every name found in it.

scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(0)))
Hello, Mr. John Smith
Hello, Ms. Jane Hill
Hello, Mr. Bill Brown

Of course, you can choose to print only subparts of the names, such as the title and the last name.

scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(1) + ". " + m.group(3)))
Hello, Mr. Smith
Hello, Ms. Hill
Hello, Mr. Brown

Or you can filter the results, e.g. to only the Mr’s, and then print only the first names.

scala> Name.findAllIn(sentence).matchData.filter(m=>m.group(1) == "Mr").foreach(m => println("Hello, " + m.group(2)))
Hello, John
Hello, Bill

Notice that in the above lines, I didn’t convert the MatchIterator to a List since I was happy to just go through the list once and do some actions.

Performing substitutions

The other thing you gain is the ability to use regular expressions for substituting once class of expressions with another. For example, let’s say that (for some odd reason) you would like to reverse everyone’s name so that “Mr. John Smith” becomes “Mr. Smith John“. This is accomplished by using the Regex method replaceAllIn, which takes two arguments: the first is the original string and the second is a function that takes a Match object and returns a String.

scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

The variable m above is referring to each of the Match objects identified, in turn. That means we can access the groups as we did before. The thing that might feel strange at first is that the anonymous function m => m.group(1) + “. ” + m.group(3) + ” ” + m.group(2) is an argument. It’s not very different from the following, where we first create a named function and then pass it as an argument.

scala> def swapFirstLast = (m: scala.util.matching.Regex.Match) => m.group(1) + ". " + m.group(3) + " " + m.group(2)
swapFirstLast: (util.matching.Regex.Match) => java.lang.String

scala> val swapped = Name.replaceAllIn(sentence, swapFirstLast)swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

Note that now that we’ve defined it, we can use that same function to map the Matches returned by findAllIn to their swapped versions.

scala> val swappedNames = Name.findAllIn(sentence).matchData.map(swapFirstLast).toList
swappedNames: List[java.lang.String] = List(Mr. Smith John, Ms. Hill Jane, Mr. Brown Bill)

The difference is that using findAllIn gives us the Match results themselves, whereas replaceAllIn replaces them in the String in situ. Whether you need to do one or the other depends on your programming needs.

Determining whether an entire string matches using the Regex API

If you just want to know whether an entire given string matches a Regex, Scala unfortunately has a somewhat roundabout way for you to do this. First, here is the syntax, testing whether Name matches on the variables smith and sentence.

scala> Name.pattern.matcher(smith).matches
res21: Boolean = true

scala> Name.pattern.matcher(sentence).matches
res22: Boolean = false

So, sentence doesn’t match (despite having three names in it) because the entire string is not a single match to Name.

What is going on here is that we are actually using classes defined in Java for working with regular expressions. First, we get the java.util.regex.Pattern object associated with our scala.util.matching.Regex object.

scala> Name.pattern
res16: java.util.regex.Pattern = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)

Then we use that Pattern to get a java.util.regex.Matcher for the string.

scala> Name.pattern.matcher(smith)
res17: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+) region=0,14 lastmatch=]

The Matcher class has a matches method that tells us whether there was a match or not for that string.

scala> Name.pattern.matcher(smith).matches
res18: Boolean = true

So, long-winded, but you can do it.

Note: there is another way to do this using Scala’s standard pattern matching paradigm discussed in the previous post on regexes.

scala> smith match { case Name(_,_,_) => true; case _ => false }
res23: Boolean = true

scala> sentence match { case Name(_,_,_) => true; case _ => false }
res24: Boolean = false

However, this requires the extra work of specifying the capture groups, which are being thrown away anyway.

Simple substitutions with a second regular expression

There is another replaceAllIn method that takes a String defining a (fairly) standard regular expresion substitution as its second argument rather than a function from Matches to Strings. This argument defines a regular expression similar to that used in standard s/// substitutions from the Perl programming language,e.g. the following, which turns strings like “xyzaaaabbb123” int “xyzbbbaaaa123“.


s/(a+)(b+)/\2\1/

Unlike Perl (which is the same as the syntax discussed in Jurafsky and Martin’s book), Scala uses $1, $2, etc. As an example, consider the first-last name swap we did before. Here it is repeated:

scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

You can get the exact same effect somewhat more easily by constructing the replacement string with $n variables that refer to the groups.

scala> val swapped2 = Name.replaceAllIn(sentence, "$1. $3 $2")
swapped2: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

This is far more concise and readable than the m.group() style above, so it is preferable for cases like this. However, sometimes you’ll want to do some more interesting processing of the values in each group, such as changing the titles to another language and outputing only the first initial of the first name: e.g. “Mr. John Smith” would become “Sr. J. Smith” and “Mrs. Jane Hill” would become “Sra. J. Hill”. It isn’t clear to me how one could do this with the $n substitutions (if some reader is aware, please let me know). To do it with the Match => String function, it is straightforward. First, let’s define a method that maps the titles from English to Spanish.

def engTitle2Esp (title: String) = title match {
  case "Mr" => "Sr"
  case "Mrs" => "Sra"
  case "Ms" => "Srta"
}

Then we pass m.group(1) through that function by using engTitle2Esp(m.group(1)), and get just the first character of group 2 by indexing into it as m.group(2)(0).

scala> val spanishized = Name.replaceAllIn(sentence, m => engTitle2Esp(m.group(1)) + ". " + m.group(2)(0) + ". " + m.group(3))
spanishized: String = Sr. J. Smith said hello to Srta. J. Hill and then to Sr. B. Brown.

This gives you considerable control over how to process the replacements.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: regular expressions, matching

Preface

This is part 5 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post is the first of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. This tutorial explains how to use them with Scala, assuming that the reader is already familiar with regular expression syntax. It shows how to create regular expressions in Scala and use them with Scala powerful pattern matching capabilities, in particular for variable assignment and cases in match expressions.

Creating regular expressions

Scala provides a very simple way to create regexes: just define a regex as a string and then call the r method on it. The following defines a regular expression that characterizes the string language a^mb^n (one or more a‘s followed by one or more b’s, not necessarily the same as the number of a‘s).

scala> val AmBn = "a+b+".r
AmBn: scala.util.matching.Regex = a+b+

To use meta-characters, like \s, \w, and \d, you must either escape the slashes or use multiquoted strings, which are referred to as raw strings. The following are two equivalent ways to write a regex that covers strings of a sequence of word characters followed by a sequence of digits.

scala> val WordDigit1 = "\\w+\\d+".r
WordDigit1: scala.util.matching.Regex = \w+\d+

scala> val WordDigit2 = """\w+\d+""".r
WordDigit2: scala.util.matching.Regex = \w+\d+

Whether escaping or using raw strings is preferable depends on the context. For example, with the above, I’d go with the raw string. However, for using a regex to split a string on whitespace characters, escaping is somewhat preferable.

scala> val adder = "We're as similar as two dissimilar things in a pod.\n\t-Blackadder"
adder: java.lang.String =
We're as similar as two dissimilar things in a pod.
-Blackadder

scala> adder.split("\\s+")
res2: Array[java.lang.String] = Array(We're, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)

scala> adder.split("""\s+""")
res3: Array[java.lang.String] = Array(We're, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)

A note on naming: the convention in Scala is to use variable names with the first letter uppercased for Regex objects. This makes them consistent with the use of pattern matching in match statements, as shown below.

Matching with regexes

We saw above that using the r method on a String returns a value that is a Regex object (more on the scala.util.matching part below). How do you actually do useful things with these Regex objects? There are a number of ways. The prettiest, and perhaps most common for the non-computational linguist, is to use them in tandem with Scala’s standard pattern matching capabilities. Let’s consider the task of parsing names and turning them into useful data structures that we can do various useful things with.

scala> val Name = """(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Ms. Sally Kenton"
title: String = Ms
first: String = Sally
last: String = Kenton

Notice the similarity with pattern matching on types like Array and List.

scala> val Array(title, first, last) = "Mr. James Stevens".split(" ")
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens

scala> val List(title, first, last) = "Mr. James Stevens".split(" ").toList
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens

Of course, notice that here the “.” was captured, while the regex excised it. A more substantive difference with the regular expression is that it only accepts strings with the right form and will reject others, unlike simple splitting and matching to Array.

scala> val Array(title, first, last) = "221B Baker Street".split(" ")
title: java.lang.String = 221B
first: java.lang.String = Baker
last: java.lang.String = Street

scala> val Name(title, first, last) = "221B Baker Street"
scala.MatchError: 221B Baker Street (of class java.lang.String)
at .<init>(<console>:12)
at .<clinit>(<console>)
at .<init>(<console>:11)
at .<clinit>(<console>)
at $export(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:592)
at scala.tools.nsc.interpreter.IMain$Request$$anonfun$10.apply(IMain.scala:828)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:31)
at java.lang.Thread.run(Thread.java:680)

That’s a lot of complaining, of course, but actually you would generally be either (a) absolutely sure that you have strings that are in the correct format or (b) you will be checking for such possible exceptions or (c) you’ll be using the regex as one option of many in a match expression.

For now, let’s assume the input is appropriate. This means we can easily convert a list of names as strings into a list of tuples using map and a match expression.

scala> val names = List("Mr. James Stevens", "Ms. Sally Kenton", "Mrs. Jane Doe", "Mr. John Doe", "Mr. James Smith")
names: List[java.lang.String] = List(Mr. James Stevens, Ms. Sally Kenton, Mrs. Jane Doe, Mr. John Doe, Mr. James Smith)

scala> names.map(x => x match { case Name(title, first, last) => (title, first, last) })
res11: List[(String, String, String)] = List((Mr,James,Stevens), (Ms,Sally,Kenton), (Mrs,Jane,Doe), (Mr,John,Doe), (Mr,James,Smith))

Note the crucial use of groups in the Name regex: the number of groups equal the number of variables being initialized in the match. The first group is needed for the alternatives Mr, Mrs, and Ms. Without the other groups, we get an error. (From here on, I’ll shorten the MatchError output.)

scala> val NameOneGroup = """(Mr|Mrs|Ms)\. [A-Z][a-z]+ [A-Z][a-z]+""".r
NameOneGroup: scala.util.matching.Regex = (Mr|Mrs|Ms)\. [A-Z][a-z]+ [A-Z][a-z]+

scala> val NameOneGroup(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)

Of course, we can still match to the first group.

scala> val NameOneGroup(title) = "Mr. James Stevens"
title: String = Mr

What if we go in the other direction, creating more groups so that we can, for example, share the “M” in the various titles? Here’s an attempt.

scala> val NameShareM = """(M(r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareM: scala.util.matching.Regex = (M(r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareM(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)

What happened is that a new group was created, so there are now four groups to match.

scala> val NameShareM(title, titleEnding, first, last) = "Mr. James Stevens"
title: String = Mr
titleEnding: String = r
first: String = James
last: String = Stevens

scala> val NameShareM(title, titleEnding, first, last) = "Mrs. Sally Kenton"
title: String = Mrs
titleEnding: String = rs
first: String = Sally
last: String = Kenton

So, there is submatched group capturing. To stop the (r|rs|s) part from creating a match group while still being able to use it to group alternatives in a disjunction, use the ?: operator.

scala> val NameShareMThreeGroups = """(M(?:r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareMThreeGroups: scala.util.matching.Regex = (M(?:r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareMThreeGroups(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

By this point, sharing the M hasn’t saved anything over (Mr|Mrs|Ms), but there are plenty of situations where this is quite useful.

We can also use regex backreferences. Say we want to match names like “Mr. John Bohn“, “Mr. Joe Doe“, and “Mrs. Jill Hill“.

scala> val RhymeName = """(Mr|Mrs|Ms)\. ([A-Z])([a-z]+) ([A-Z])\3""".r
RhymeName: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z])([a-z]+) ([A-Z])\3

scala> val RhymeName(title, firstInitial, firstRest, lastInitial) = "Mr. John Bohn"
title: String = Mr
firstInitial: String = J
firstRest: String = ohn
lastInitial: String = B

Then we could piece things together to get the names we wanted.

scala> val first = firstInitial+firstRest
first: java.lang.String = John

scala> val last = lastInitial+firstRest
last: java.lang.String = Bohn

But we can do better by using an embedded group and just thowing its match result away with the underscore _.

scala> val RhymeName2 = """(Mr|Mrs|Ms)\. ([A-Z]([a-z]+)) ([A-Z]\3)""".r
RhymeName2: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z]([a-z]+)) ([A-Z]\3)

scala> val RhymeName2(title, first, _, last) = "Mr. John Bohn"
title: String = Mr
first: String = John
last: String = Bohn

Note: we can’t use the ?: operator with ([a-z]+) to stop the match because we need exactly that string to match with the \3 later.

Using regexes for assignment via pattern matching requires full string match.

scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Mr. James Stevens walked to the door."
scala.MatchError: Mr. James Stevens walked to the door. (of class java.lang.String)

This is a crucial aspect of using them in match expressions. Consider an application that needs to be able to parse telephone numbers in different formats, like (123)555-5555 and 123-555-5555. Here are regexes for these two patterns and their use to parse these numbers.

scala> val Phone1 = """\((\d{3})\)\s*(\d{3})-(\d{4})""".r
Phone1: scala.util.matching.Regex = \((\d{3})\)\s*(\d{3})-(\d{4})

scala> val Phone2 = """(\d{3})-(\d{3})-(\d{4})""".r
Phone2: scala.util.matching.Regex = (\d{3})-(\d{3})-(\d{4})

scala> val Phone1(area, first3, last4) = "(123) 555-5555"
area: String = 123
first3: String = 555
last4: String = 5555

scala> val Phone2(area, first3, last4) = "123-555-5555"
area: String = 123
first3: String = 555
last4: String = 5555

We could of course use a single regular expression, but we’ll go with these two so that they can be used as separate case statements in a match expression that is part of a function that takes a string representation of a phone number and returns a tuple of three strings (thus normalizing the numbers).

def normalizePhoneNumber (number: String) = number match {
  case Phone1(x,y,z) => (x,y,z)
  case Phone2(x,y,z) => (x,y,z)
}

The action being taken for each match is just to package the separate values up in a Tuple3 — more interesting things could be done if one were looking for country codes, dealing with multiple countries, etc. The point here is to see how the regular expressions are used for the cases to capture values and assign them to local variables, each time appropriate for the form of the string that is brought in. (We’ll see in a later tutorial how to protect such a method from inputs that are not phone numbers and such.)

Now that we have that function, we can easily apply it to a list of strings representing phone numbers and filter out just those in a specific area, for example.

scala> val numbers = List("(123) 555-5555", "123-555-5555", "(321) 555-0000")
numbers: List[java.lang.String] = List((123) 555-5555, 123-555-5555, (321) 555-0000)

scala> numbers.map(normalizePhoneNumber)
res16: List[(String, String, String)] = List((123,555,5555), (123,555,5555), (321,555,0000))

scala> numbers.map(normalizePhoneNumber).filter(n => n._1=="123")
res17: List[(String, String, String)] = List((123,555,5555), (123,555,5555))

Building Regexes from Strings

Sometimes one wants to build up a regex from smaller component parts, for example, defining what a noun phrase is and then searching for sequence of noun phrases. To do this, we first must see the longer form of creating a regex.

scala> val AmBn = new scala.util.matching.Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+

This is the first time in these tutorials that we are explicitly creating an object using the reserved word new. We’ll be covering objects in more detail later, but what you need to know now is that Scala has a great deal of functionality that is not available by default. Mostly, we’ve been working with things like Strings, Ints, Doubles, Lists, and so on — and for the most part it has appeared to you as though they are “just” Strings, Ints, Doubles, and Lists. However, that is not the case: actually they are fully specified as:

  • java.lang.String
  • scala.Int
  • scala.Double
  • scala.List

And, in the case of the last one, scala.List is a type that is actually backed by a concrete implementation in scala.collection.immutable.List. So, when you just see “List”, Scala is actually hiding some detail; most importantly, it makes it possible to use extremely common types with very little fuss.

What scala.util.matching.Regex is telling you is that the Regex class is part of the scala.util.matching package (and that scala.util.matching is a subpackage of scala.util, which itself is a subpackage of the scala package). Fortunately, you don’t need to type out scala.util.matching every time you want to use Regex: just use an import statement, and then use Regex without the extra package specification.

scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala> val AmBn = new Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+

The other thing to explain is the new part. Again, we’ll cover this in more detail later, but for now think about it the following way. The Regex class is like a factory for producing regex objects, and the way you request (order) one of those objects is to say “new Regex(…)“, where the indicates the string that should be used to define the properties of that object. You’ve actually been doing this quite a lot already when creating Lists, Ints, and Doubles, but again, for those core types, Scala has provided special syntax to simplify their creation and use.

Okay, but why would one want to use new Regex(“a+b+”) when “a+b+”.r can be used to do the same? Here’s why: the latter needs to be given a complete string, but the former can be built up from several String variables. As an example, say you want a regex that matches strings of the form “the/a dog/cat/mouse/bird chased/ate the/a dog/cat/mouse/bird” such as “the dog chased the cat” and “a cat chased the bird.” The following might be the first attempt.

scala> val Transitive = "(a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)".r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

This works, but we can also build it without repeating the same expression twice by using a variable that contains a String defining a regular expression (but which is not a Regex object itself) and building the regex with that.

scala> val nounPhrase = "(a|the) (dog|cat|mouse|bird)"
nounPhrase: java.lang.String = (a|the) (dog|cat|mouse|bird)

scala> val Transitive = new Regex(nounPhrase + " (chased|ate) " + nounPhrase)
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

UPDATE: Actually, you can do this with .r rather than new Regex(…).


scala> val Transitive = (nounPhrase + " (chased|ate) " + nounPhrase).r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

The next tutorial will show how to use the scala.util.matching package API to do more extensive matching with regular expressions, such as finding multiple matches and performing substitutions.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: iteration, for expressions, yield, map, filter, count

Preface

This is part 4 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial departs from the very beginner nature of the previous three, so this may be of more interest to readers who already have some programming experience in another language. (Though also, see the section on using matching in Scala in Part 3.)

Iteration, the Scala way(s)

Up to now, we have (mostly) accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”. Here is how to say this in Scala.

scala> val animals = List("newt", "armadillo", "cat", "guppy")
animals: List[java.lang.String] = List(newt, armadillo, cat, guppy)

scala> animals.foreach(println)
newt
armadillo
cat
guppy

This says to take each element of the list (indicated by foreach) and apply a function (in this case, println) to it, in order. There is some underspecification going on in that we aren’t providing a variable to name elements. This works in some cases, such as above, but won’t always be possible. Here’s is how it looks in full, with a variable naming the element.

scala> animals.foreach(animal => println(animal))
newt
armadillo
cat
guppy

This is useful when you need to do a bit more, such as concatenating a String element with another String.

scala> animals.foreach(animal => println("She turned me into a " + animal))
She turned me into a newt
She turned me into a armadillo
She turned me into a cat
She turned me into a guppy

Or, if you are performing a computation with it, like outputing the length of each element in a list of strings.

scala> animals.foreach(animal => println(animal.length))
4
9
3
5

We can obtain the same result as foreach using a for expression.

scala> for (animal <- animals) println(animal.length)
4
9
3
5

With what we have been doing so far, these two ways of expressing the pattern of iterating over the elements of a List are equivalent. However, they are different: a for expression returns a value, whereas foreach simply performs some function on every element of the list. This latter kind of use is  termed a side-effect: by printing out each element, we are not creating new values, we are just performing an action on each element. With for expressions, we can yield values that create transformed Lists. For example, contrast using println with the following.

scala> val lengths = for (animal <- animals) yield animal.length
lengths: List[Int] = List(4, 9, 3, 5)

The result is a new list that contains the lengths (number of characters) of each of the elements of the animals list. (You can of course print its contents now by doing lengths.foreach(println), but typically we want to do other, usually more interesting, things with such values.)

What we just did was map the values of animals into a new set of values in a one-to-one manner, using the function length. Lists have another function called map that does this directly.

scala> val lengthsMapped = animals.map(animal => animal.length)
lengthsMapped: List[Int] = List(4, 9, 3, 5)

So, the for-yield expression and the map method achieve the same output, and in many cases they are pretty much equivalent. Using map, however, is often more convenient because you can easily chain a series of operations together. For example, let’s say you want to add 1 to a List of numbers and then get the square of that, so turning List(1,2,3) into List(2,3,4) into List(4,9,16). You can do that quite easily using map.

nums.map(x=>x+1).map(x=>x*x)

Some readers will be puzzled by what was just done. Here it is more explicitly, using an intermediate variable nums2 to store the add-one list.

scala> val nums2 = nums.map(x=>x+1)
nums2: List[Int] = List(2, 3, 4)

scala> nums2.map(x=>x*x)
res9: List[Int] = List(4, 9, 16)

Since nums.map(x=>x+1) returns a List, we don’t have to name it to a variable to use it — we can just immediately use it, including doing another map function on it. (Of course, one could do this computation in a single go, e.g. map((x+1)*(x+1)), but often one is using a series of built-in functions, or functions one has predefined already).

You can keep on mapping to your heart’s content, including mapping from Ints to Strings.

scala> nums.map(x=>x+1).map(x=>x*x).map(x=>x-1).map(x=>x*(-1)).map(x=>"The answer is: " + x)
res12: List[java.lang.String] = List(The answer is: -3, The answer is: -8, The answer is: -15)

Note: the use of x in all these cases is not important. They could have been named x, y, z and turlingdromes42 — any valid variable name.

Iterating through multiple lists

Sometimes you have two lists that are paired up and you need to do something to elements from each list simultaneously. For example, let’s say you have a list of word tokens and another list with their parts-of-speech. (See the previous tutorial for discussion of parts-of-speech.)

scala> val tokens = List("the", "program", "halted")
tokens: List[java.lang.String] = List(the, program, halted)

scala> val tags = List("DT","NN","VB")
tags: List[java.lang.String] = List(DT, NN, VB)

Now, let’s say we want to output these as the following string:

the/DT program/NN halted/VB

Initially, we’ll do it a step at a time, and then show how it can be done all in one line.

First, we use the zip function to bring two lists together and get a new list of pairs of elements from each list.

scala> val tokenTagPairs = tokens.zip(tags)
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((the,DT), (program,NN), (halted,VB))

Zipping two lists together in this way is a common pattern used for iterating over two lists.

Now we have a list of token-tag pairs we can use a for expression to turn it into a List of strings.

scala> val tokenTagSlashStrings = for ((token, tag) <- tokenTagPairs) yield token + "/" + tag
tokenTagSlashStrings: List[java.lang.String] = List(the/DT, program/NN, halted/VB)

Now we just need to turn that list of strings into a single string by concatenating all its elements with a space between each. The function mkString makes this easy.

scala> tokenTagSlashStrings.mkString(" ")
res19: String = the/DT program/NN halted/VB

Finally, here it all is in one step.

scala> (for ((token, tag) <- tokens.zip(tags)) yield token + "/" + tag).mkString(" ")
res23: String = the/DT program/NN halted/VB

Ripping a string into a useful data structure

It is common in computational linguistics to need convert string inputs into useful data structures. Consider the part-of-speech tagged sentence mentioned in the previous tutorial. Let’s begin by assigning it to the variable sentRaw.


val sentRaw = "The/DT index/NN of/IN the/DT 100/CD largest/JJS Nasdaq/NNP financial/JJ stocks/NNS rose/VBD modestly/RB as/IN well/RB ./."

Now, let’s turn it into a List of Tuples, where each Tuple has the word as its first element and the postag as its second. We begin with the single line that does this so that you can see what the desired result is, and then we’ll examine each step in detail.

scala> val tokenTagPairs = sentRaw.split(" ").toList.map(x => x.split("/")).map(x => Tuple2(x(0), x(1)))
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((The,DT), (index,NN), (of,IN), (the,DT), (100,CD), (largest,JJS), (Nasdaq,NNP), (financial,JJ), (stocks,NNS), (rose,VBD), (modestly,RB), (as,IN), (well,RB), (.,.))

Let’s take each of these in turn. The first split cuts sentRaw at each space character, and returns an Array of Strings, where each element is the material between the spaces.

scala> sentRaw.split(" ")
res0: Array[java.lang.String] = Array(The/DT, index/NN, of/IN, the/DT, 100/CD, largest/JJS, Nasdaq/NNP, financial/JJ, stocks/NNS, rose/VBD, modestly/RB, as/IN, well/RB, ./.)

What’s an Array? It’s a kind of sequence, like List, but it has some different properties that we’ll discuss later. For now, let’s stick with Lists, which we can do by using the toList method. Additionally, let’s assign it to a variable so that the remaining operations are easier to focus on.

scala> val tokenTagSlashStrings = sentRaw.split(" ").toList
tokenTagSlashStrings: List[java.lang.String] = List(The/DT, index/NN, of/IN, the/DT, 100/CD, largest/JJS, Nasdaq/NNP, financial/JJ, stocks/NNS, rose/VBD, modestly/RB, as/IN, well/RB, ./.)

Now, we need to turn each of the elements in that list into pairs of token and tag. Let’s first consider a single element, turning something like “The/DT” into the pair (“The”,”DT”). The next lines show how to do this one step at a time, using intermediate variables.

scala> val first = "The/DT"
first: java.lang.String = The/DT

scala> val firstSplit = first.split("/")
firstSplit: Array[java.lang.String] = Array(The, DT)

scala> val firstPair = Tuple2(firstSplit(0), firstSplit(1))
firstPair: (java.lang.String, java.lang.String) = (The,DT)

So, firstPair is a tuple representing the information encoded in the string first. This involved two operations, splitting and then creating a tuple from the Array that resulted from the split. We can do this for all of the elements in tokenTagSlashStrings using map. Let’s first convert the Strings into Arrays.

scala> val tokenTagArrays = tokenTagSlashStrings.map(x => x.split("/"))
res0: List[Array[java.lang.String]] = List(Array(The, DT), Array(index, NN), Array(of, IN), Array(the, DT), Array(100, CD), Array(largest, JJS), Array(Nasdaq, NNP), Array(financial, JJ), Array(stocks, NNS), Array(rose, VBD), Array(modestly, RB), Array(as, IN), Array(well, RB), Array(., .))

And finally, we turn the Arrays into Tuple2s and get the result we obtained with the one-liner earlier.

scala> val tokenTagPairs = tokenTagArrays.map(x => Tuple2(x(0), x(1)))
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((The,DT), (index,NN), (of,IN), (the,DT), (100,CD), (largest,JJS), (Nasdaq,NNP), (financial,JJ), (stocks,NNS), (rose,VBD), (modestly,RB), (as,IN), (well,RB), (.,.))

Note: if you are comfortable with using one-liners that chain a bunch of operations together, then by all means use them. However, there is no shame in using several lines involving a bunch of intermediate variables if that helps you break apart the task and get the result you need.

One of the very useful things of having a List of pairs (Tuple2s) is that the unzip function gives us back two Lists, one with all of the first elements and another with all of the second elements.

scala> val (tokens, tags) = tokenTagPairs.unzip
tokens: List[java.lang.String] = List(The, index, of, the, 100, largest, Nasdaq, financial, stocks, rose, modestly, as, well, .)
tags: List[java.lang.String] = List(DT, NN, IN, DT, CD, JJS, NNP, JJ, NNS, VBD, RB, IN, RB, .)

With this, we’ve come full circle. Having started with a raw string (such as we are likely to read in from a text file), we now have Lists that allow us to do useful computations, such as converting those tags into another form.

Providing a function you have defined to map

Let’s return to the postag simplification exercise we did in the previous tutorial. We’ll modify it a bit: rather than shortening the Penn Treebank parts-of-speech, let’s convert them to course parts-of-speech using the English words that most people are familiar with, like noun and verb. The following function turns Penn Treebank tags into these course tags, for more tags than we covered in the last tutorial (note: this is still incomplete, but serves to illustrate the point).

def coursePos (tag: String) = tag match {
  case "NN" | "NNS" | "NNP" | "NNPS"                       => "Noun"
  case "JJ" | "JJR" | "JJS"                                => "Adjective"
  case "VB" | "VBD" | "VBG" | "VBN" | "VBP" | "VBZ" | "MD" => "Verb"
  case "RB" | "RBR" | "RBS" | "WRB" | "EX"                 => "Adverb"
  case "PRP" | "PRP$" | "WP" | "WP$"                       => "Pronoun"
  case "DT" | "PDT" | "WDT"                                => "Article"
  case "CC"                                                => "Conjunction"
  case "IN" | "TO"                                         => "Preposition"
  case _                                                   => "Other"
}

We can now map this function over the parts of speech in the collection obtained previously.

scala> tags.map(coursePos)
res1: List[java.lang.String] = List(Article, Noun, Preposition, Article, Other, Adjective, Noun, Adjective, Noun, Verb, Adverb, Preposition, Adverb, Other)

Voila! If we want to convert the tags in this manner and then output them as a string like what we started with, it’s just a few steps. We’ll start from the beginning and recap. Try running the following for yourself.

val sentRaw = "The/DT index/NN of/IN the/DT 100/CD largest/JJS Nasdaq/NNP financial/JJ stocks/NNS rose/VBD modestly/RB as/IN well/RB ./."

val (tokens, tags) = sentRaw.split(" ").toList.map(x => x.split("/")).map(x => Tuple2(x(0), x(1))).unzip

tokens.zip(tags.map(coursePos)).map(x => x._1+"/"+x._2).mkString(" ")

A further point is that when you provide expressions like (x => x+1) to map, you are actually defining an anonymous function! Here is the same map operation with different levels of specification


scala> val numbers = (1 to 5).toList
numbers: List[Int] = List(1, 2, 3, 4, 5)

scala> numbers.map(1+)
res11: List[Int] = List(2, 3, 4, 5, 6)

scala> numbers.map(_+1)
res12: List[Int] = List(2, 3, 4, 5, 6)

scala> numbers.map(x=>x+1)
res13: List[Int] = List(2, 3, 4, 5, 6)

scala> numbers.map((x: Int) => x+1)
res14: List[Int] = List(2, 3, 4, 5, 6)

So, it’s all consistent: whether you pass in a named function or an anonymous function, map will apply it to each element in the list.

Finally, note that you can use that final form to define a function.


scala> def addOne = (x: Int) => x + 1
addOne: (Int) => Int

scala> addOne(1)
res15: Int = 2

This is similar to defining functions as we had previously (e.g. def addOne (x: Int) = x+1), but it is more convenient in certain contexts, which we’ll get to later. For now, the thing to realize is that whenever you map, you are either using a function that already existed or creating one on the fly.

Filtering and counting

The map method is a convenient way of performing computations on each element of a List, effectively transforming a List from one set of values to a new List with a set of values computed from each corresponding element. There are yet more methods that have other actions, such as removing elements from a List (filter), counting the number of elements satisfying a given predicate (count), and computing an aggregate single result from all elements in a List (reduce and fold). Let’s consider a simple task: count how many tokens are not a noun or adjective in a tagged sentence. As a starting point, let’s take the list of mapped postags from before.

scala> val courseTags = tags.map(coursePos)
courseTags: List[java.lang.String] = List(Article, Noun, Preposition, Article, Other, Adjective, Noun, Adjective, Noun, Verb, Adverb, Preposition, Adverb, Other)

One way of doing this is to filter out all of the nouns and adjectives to obtain a list without them and then get its length.

scala> val noNouns = courseTags.filter(x => x != "Noun")noNouns: List[java.lang.String] = List(Article, Preposition, Article, Other, Adjective, Adjective, Verb, Adverb, Preposition, Adverb, Other)

scala> val noNounsOrAdjectives = noNouns.filter(x => x != "Adjective")
noNounsOrAdjectives: List[java.lang.String] = List(Article, Preposition, Article, Other, Verb, Adverb, Preposition, Adverb, Other)

scala> noNounsOrAdjectives.length
res8: Int = 9

However, because filter just takes a Boolean value, we can of course use Boolean conjunction and disjunction to simplify things. And, we don’t need to save intermediate variables. Here’s the one liner.

scala> courseTags.filter(x => x != "Noun" && x != "Adjective").length
res9: Int = 9

If all we want is the number of elements, we can instead just use count with the same predicate.

scala> courseTags.count(x => x != "Noun" && x != "Adjective")
res10: Int = 9

As an exercise, try doing a one-liner that starts with sentRaw and provides the value “resX: Int = 9” (where X is whatever you get in your Scala REPL).

In the next tutorial, we’ll see how to use reduce and fold to compute aggregate results from a List.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: conditional execution with if-else blocks and matching

Preface

This is part 3 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

Conditionals

Variables come and variables go, and they take on different values depending on the input. We typically need to enact different behaviors conditioned on those values. For example, let’s simulate a bar tender in Austin who must make sure that he doesn’t give alcohol to individuals under 21 years of age.

scala> def serveBeer (customerAge: Int) = if (customerAge >= 21) println("beer") else println("water")
serveBeer: (customerAge: Int)Unit

scala> serveBeer(23)
beer

scala> serveBeer(19)
water

What we’ve done here is a standard use of conditionals to produce one action or another — in this case just printing one message or another. The expression in the if (…) is a Boolean value, either true or false. You can see this by just doing the inequality directly:

scala> 19 >= 21
res7: Boolean = false

And these expressions can be combined according to the standard rules for conjunction and disjunction of Booleans. Conjunction is indicated with && and disjunction with ||.

scala> 19 >= 21 || 5 > 2
res8: Boolean = true

scala> 19 >= 21 && 5 > 2
res9: Boolean = false

To check equality, use ==.

scala> 42 == 42
res10: Boolean = true

scala> "the" == "the"
res11: Boolean = true

scala> 3.14 == 6.28
res12: Boolean = false

scala> 2*3.14 == 6.28
res13: Boolean = true

scala> "there" == "the" + "re"
res14: Boolean = true

The equality operator == is different from the assignment operator =, and you’ll get an error if you attempt to use = for equality tests.

scala> 5 = 5
<console>:1: error: ';' expected but '=' found.
5 = 5
^

scala> x = 5
<console>:10: error: not found: value x
val synthvar$0 = x
^
<console>:7: error: not found: value x
x = 5
^

The first example is completely bad because we cannot hope to assign a value to a constant like 5. With the latter example, the error complains about not finding a value x. That’s because it is a valid construct, assuming that a var variable x has been previously defined.

scala> var x = 0
x: Int = 0

scala> x = 5
x: Int = 5

Recall that with var variables, it is possible to assign them a new value. However, it is actually not necessary to use vars much of the time, and there are many advantages with sticking with vals. I’ll be helping you think in these terms as we go along. For now, try to ignore the fact that vars exist in the language!

Back to conditionals. First, here are more comparison operators:

x == y   (x is equal to y)
x != y    (x does not equal y)
x > y     (x is larger than y)
x < y     (x is less than y)
x >= y   (x is equal to y, or larger than y)
x <= y   (x is equal to y, or less than y)

These operators work on any type that has a natural ordering, including Strings.

scala> "armadillo" < "bear"
res25: Boolean = true

scala> "armadillo" < "Bear"
res26: Boolean = false

scala> "Armadillo" < "Bear"
res27: Boolean = true

Clearly, this isn’t the usual alphabetic ordering you are used to. Instead it is based on ASCII character encodings.

A very beautiful and useful thing about conditionals in Scala is that they return a value. So, the following is a valid way to set the values of the variables x and y.

scala> val x = if (true) 1 else 0
x: Int = 1

scala> val y = if (false) 1 else 0
y: Int = 0

Not so impressive here, but let’s return to the bartender, and rather than the serveBeer function printing a String, we can have it return a String representing a beverage, “beer” in the case of a 21+ year old and “water” otherwise.

scala> def serveBeer (customerAge: Int) = if (customerAge >= 21) "beer" else "water"
serveBeer: (customerAge: Int)java.lang.String

scala> serveBeer(42)
res21: java.lang.String = beer

scala> serveBeer(20)
res22: java.lang.String = water

Notice how the first serveBeer function returned Unit but this one returns a String. Unit means that no value is returned — in general this is to be discouraged for reasons we won’t get into here. Regardless of that, the general pattern of conditional assignment shown above is something you’ll be using a lot.

Conditionals can also have more than just the single if and else.  For example, let’s say that the bartender simply serves age appropriate drinks to each customer, and that 21+ get beer, teenagers get soda and little kids should get juice.

scala> def serveDrink (customerAge: Int) = {
|     if (customerAge >= 21) "beer"
|     else if (customerAge >= 13) "soda"
|     else "juice"
| }
serveDrink: (customerAge: Int)java.lang.String

scala> serveDrink(42)
res35: java.lang.String = beer

scala> serveDrink(16)
res36: java.lang.String = soda

scala> serveDrink(6)
res37: java.lang.String = juice

And of course, the Boolean expressions in any of the ifs or else ifs can be complex conjunctions and disjunctions of smaller expressions. Let’s consider a computational linguistics oriented example now that can take advantage of that, and which we will continue to build on in later tutorials.

Everybody (hopefully) knows what a part-of-speech is. (If not, go check out Grammar Rock on YouTube.) In computational linguistics, we tend to use very detailed tagsets that go far beyond “noun”, “verb”, “adjective” and so on. For example, the tagset from the Penn Treebank uses NN for singular nouns (table), NNS for plural nouns (tables), NNP for singular proper noun (John), and NNPS for plural proper noun (Vikings).

Here’s an annotated sentence with postags from the first sentence of the Wall Street Journal portion of the Penn Treebank, in the format word/postag.

The/DT index/NN of/IN the/DT 100/CD largest/JJS Nasdaq/NNP financial/JJ stocks/NNS rose/VBD modestly/RB as/IN well/RB ./.

We’ll see how to process these en masse shortly, but for now, let’s build a function that turns single tags like “NNP” into “NN” and “JJS” into “JJ”, using conditionals. We’ll let all the other postags stay as they are.

We’ll start with a suboptimal solution, and then refine it. The first thing you might try is to create a case for every full form tag and output its corresponding shortened tag.

scala> def shortenPos (tag: String) = {
|     if (tag == "NN") "NN"
|     else if (tag == "NNS") "NN"
|     else if (tag == "NNP") "NN"
|     else if (tag == "NNPS") "NN"
|     else if (tag == "JJ") "JJ"
|     else if (tag == "JJR") "JJ"
|     else if (tag == "JJS") "JJ"
|     else tag
| }
shortenPos: (tag: String)java.lang.String

scala> shortenPos("NNP")
res47: java.lang.String = NN

scala> shortenPos("JJS")
res48: java.lang.String = JJ

So, it’s doing the job, but there is a lot of redundancy — in particular, the return value is the same for many cases. We can use disjunctions to deal with this.

def shortenPos2 (tag: String) = {
  if (tag == "NN" || tag == "NNS" || tag == "NNP" || tag == "NNP") "NN"
  else if (tag == "JJ" || tag == "JJR" || tag == "JJS") "JJ"
  else tag
}

These are logically equivalent.

There is an easier way of doing this, using properties of Strings. Here, the startsWith method is very useful.

scala> "NNP".startsWith("NN")
res51: Boolean = true

scala> "NNP".startsWith("VB")
res52: Boolean = false

We can use this to simplify the postag shortening function.

def shortenPos3 (tag: String) = {
  if (tag.startsWith("NN")) "NN"
  else if (tag.startsWith("JJ")) "JJ"
  else tag
}

This makes it very easy to add an additional condition that collapses all of the verb tags to “VB”. (Left as an exercise.)

A final note of conditional assignments: they can return anything you like, so, for example, the following are all valid. For example, here is a (very) simple (and very imperfect) English stemmer that returns the stem and and suffix.

scala> def splitWord (word: String) = {
|     if (word.endsWith("ing")) (word.slice(0,word.length-3), "ing")
|     else if (word.endsWith("ed")) (word.slice(0,word.length-2), "ed")
|     else if (word.endsWith("er")) (word.slice(0,word.length-2), "er")
|     else if (word.endsWith("s")) (word.slice(0,word.length-1), "s")
|     else (word,"")
| }
splitWord: (word: String)(String, java.lang.String)

scala> splitWord("walked")
res10: (String, java.lang.String) = (walk,ed)

scala> splitWord("walking")
res11: (String, java.lang.String) = (walk,ing)

scala> splitWord("booking")
res12: (String, java.lang.String) = (book,ing)

scala> splitWord("baking")
res13: (String, java.lang.String) = (bak,ing)

If we wanted to work with the stem and suffix directly with variables, we can assign them straight away.

scala> val (stem, suffix) = splitWord("walked")
stem: String = walk
suffix: java.lang.String = ed

Matching

Scala provides another very powerful way to encode conditional execution called matching. They have much in common with if-else blocks, but come with some nice extra features. We’ll go back to the postag shortener, starting with a full list out of the tags and what to do in each case, like our first attempt with if-else.

def shortenPosMatch (tag: String) = tag match {
  case "NN" => "NN"
  case "NNS" => "NN"
  case "NNP" => "NN"
  case "NNPS" => "NN"
  case "JJ" => "JJ"
  case "JJR" => "JJ"
  case "JJS" => "JJ"
  case _ => tag
}

scala> shortenPosMatch("JJR")
res14: java.lang.String = JJ

Note that the last case, with the underscore “_” is the default action to take, similar to the “else” at the end of an if-else block.

Compare this to the if-else function shortenPos from before, which had lots of repetition in its definition of the form “else if (tag == “. Match statements allow you to do the same thing, but much more concisely and arguably, much more clearly. Of course, we can shorten this up.

def shortenPosMatch2 (tag: String) = tag match {
  case "NN" | "NNS" | "NNP" | "NNPS" => "NN"
  case "JJ" | "JJR" | "JJS" => "JJ"
  case _ => tag
}

Which is quite a bit more readable than the if-else shortenPosMatch2 defined earlier.

In addition to readability, match statements provide some logical protection. For example, if you accidentally have two cases that overlap, you’ll get an error.


scala> def shortenPosMatchOops (tag: String) = tag match {
|   case "NN" | "NNS" | "NNP" | "NNPS" => "NN"
|   case "JJ" | "JJR" | "JJS" => "JJ"
|   case "NN" => "oops"
|   case _ => tag
| }
<console>:10: error: unreachable code
case "NN" => "oops"

This is an obvious example, but with more complex match options, it can save you from bugs!

We cannot use the startsWith method the same way we did with the if-else shortenPosMatch3. However, we can use regular expressions very nicely with match statements, which we’ll get to in a later tutorial.

Where match statements really shine is that they can match on much more than just the value of simple variables like Strings and Ints.  One use of matches is to check the types of the input to a function that can take a supertype of many types. Recall that Any is the supertype of all types; if we have the following function that takes an argument with any type, we can use matching to inspect what the type of the argument is and do different behaviors accordingly.

scala> def multitypeMatch (x: Any) = x match {
|    case i: Int => "an Int: " + i*i
|    case d: Double => "a Double: " + d/2
|    case b: Boolean => "a Boolean: " + !b
|    case s: String => "a String: " + s.length
|    case (p1: String, p2: Int) => "a Tuple[String, Int]: " + p2*p2 + p1.length
|    case (p1: Any, p2: Any) => "a Tuple[Any, Any]: (" + p1 + "," + p2 + ")"
|    case _ => "some other type " + x
| }
multitypeMatch: (x: Any)java.lang.String

scala> multitypeMatch(true)
res4: java.lang.String = a Boolean: false

scala> multitypeMatch(3)
res5: java.lang.String = an Int: 9

scala> multitypeMatch((1,3))
res6: java.lang.String = a Tuple[Any, Any]: (1,3)

scala> multitypeMatch(("hi",3))
res7: java.lang.String = a Tuple[String, Int]: 92

So, for example, if it is an Int, we can do things like multiplication, if it is a Boolean we can negate it (with !), and so on. In the case statement, we provide a new variable that will have the type that is matched, and then after the arrow =>, we can use that variable in a type safe manner. Later we’ll see how to create classes (and in particular case classes), where this sort of matching based function is used regularly.

In the meantime, here’s an example of a simple addition function that allows one to enter a String or Int to specify its arguments. For example, the behavior we desire is this:

scala> add(1,3)
res4: Int = 4

scala> add("one",3)
res5: Int = 4

scala> add(1,"three")
res6: Int = 4

scala> add("one","three")
res7: Int = 4

Let’s assume that we only handle the spelled out versions of 1 through 5, and that any string we cannot handle (e.g. “six” and aardvark”) is considered to be 0. Then the following two functions using matches handle it.

def convertToInt (x: String) = x match {
  case "one" => 1
  case "two" => 2
  case "three" => 3
  case "four" => 4
  case "five" => 5
  case _ => 0
}

def add (x: Any, y: Any) = (x,y) match {
  case (x: Int, y: Int) => x + y
  case (x: String, y: Int) => convertToInt(x) + y
  case (x: Int, y: String) => x + convertToInt(y)
  case (x: String, y: String) => convertToInt(x) + convertToInt(y)
  case _ => 0
}

Like if-else blocks, matches can return whatever type you like, including Tuples, Lists and more.

Match blocks are used in many other useful contexts that we’ll come to later. In the meantime, it is also worth pointing out that matching is actually used in variable assignment. We’ve seen it already with Tuples, but it can be done with Lists and other types.

scala> val (x,y) = (1,2)
x: Int = 1
y: Int = 2

scala> val colors = List("blue","red","yellow")
colors: List[java.lang.String] = List(blue, red, yellow)

scala> val List(color1, color2, color3) = colors
color1: java.lang.String = blue
color2: java.lang.String = red
color3: java.lang.String = yellow

This is especially useful in the case of the args Array that comes from the command line when creating a script with Scala. For example, consider a program that is run as following.

$ scala nextYear.scala John 35
Next year John will be 36 years old.

Here’s how we can do it. (Save the next two lines as nextYear.scala and try it out.)

val Array(name, age) = args
println("Next year " + name + " will be " + (age.toInt + 1) + " years old.")

Notice that we had to do age.toInt. That is because age itself is a String, not an Int.

Conditional execution with if-else blocks and match blocks is a powerful part of building complex behaviors into your programs that you’ll see and use frequently!

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Follow

Get every new post delivered to your Inbox.