Topics: conventions, regexes, mapping, partitioning, vectors vs lists, overloaded constructors, case classes, traits, multiple inheritance, implicits

Preface

I’m currently teaching a course on Applied Text Analysis and am using Scala as the programming language taught and used in the course. Rather than creating more tutorials, I figured I’d take a page from Brian Dunning’s playbook on his Skeptoid podcast (highly recommended) when he takes student questions.  So, I had the students in the course submit questions about Scala that they had, based on the readings and assignments thus far. This post covers over half of them — the rest will be covered in a follow up post.

I start with some of the more basic questions, and the questions and/or answers progressively get into more intermediate level topics. Suggestions and comments to improve any of the answers are very welcome!

Basic Questions

Q. Concerning addressing parts of variables: To address individual parts of lists, the numbering of the items is (List 0,1,2 etc.) That is, the first element is called “0″. It seems to be the same for Arrays and Maps, but not for Tuples- to get the first element of a Tuple, I need to use Tuple._1. Why is that?

A. It’s just a matter of convention — tuples have used a 1-based index in other languages like Haskell, and it seems that Scala has adopted the same convention/tradition. See:

http://stackoverflow.com/questions/6241464/why-are-the-indexes-of-scala-tuples-1-based

Q. It seems that Scala doesn’t recognize the “b” boundary character as a regular expression.  Is there something similar in Scala?

A. Scala does recognize boundary characters. For example, the following REPL session declares a regex that finds “the” with boundaries, and successfully retrieves the three tokens of “the” in the example sentence.

scala> val TheRE = """\bthe\b""".r
TheRE: scala.util.matching.Regex = \bthe\b

scala> val sentence = "She think the man is a stick-in-the-mud, but the man disagrees."
sentence: java.lang.String = She think the man is a stick-in-the-mud, but the man disagrees.

scala> TheRE.findAllIn(sentence).toList
res1: List[String] = List(the, the, the)

Q. Why doesn’t the method “split” work on args? Example: val arg = args.split(” “). Args are strings right, so split should work?

A. The args variable is an Array, so split doesn’t work on them. Arrays are, in effect, already split.

Q. What is the major difference between foo.mapValues(x=>x.length) and foo.map(x=>x.length). Some places one works and one does not.

A. The map function works on all sequence types, including Seqs and Maps (note that Maps can be seen as sequences of Tuple2s). The mapValues function, however, only works on Maps. It is essentially a convenience function. As an example, let’s start with a simple Map from Ints to Ints.

scala> val foo = List((1,2),(3,4)).toMap
foo: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

Now consider the task of adding 2 to each value in the Map. This can be done with the map function as follows.

scala> foo.map { case(key,value) => (key,value+2) }
res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 4, 3 -> 6)

So, the map function iterates over key/value pairs. We need to match both of them, and then output the key and the changed value to create the new Map. The mapValues function makes this quite a bit easier.

scala> foo.mapValues(2+)
res6: scala.collection.immutable.Map[Int,Int] = Map(1 -> 4, 3 -> 6)

Returning to the question about computing the length using mapValues or map — then it is just a question of which values you are transforming, as in the following examples.

scala> val sentence = "here is a sentence with some words".split(" ").toList
sentence: List[java.lang.String] = List(here, is, a, sentence, with, some, words)

scala> sentence.map(_.length)
res7: List[Int] = List(4, 2, 1, 8, 4, 4, 5)

scala> val firstCharTokens = sentence.groupBy(x=>x(0))
firstCharTokens: scala.collection.immutable.Map[Char,List[java.lang.String]] = Map(s -> List(sentence, some), a -> List(a), i -> List(is), h -> List(here), w -> List(with, words))

scala> firstCharTokens.mapValues(_.length)
res9: scala.collection.immutable.Map[Char,Int] = Map(s -> 2, a -> 1, i -> 1, h -> 1, w -> 2)

Q. Is there any function that splits a list into two lists with the elements in the alternating positions of the original list? For example,

MainList =(1,2,3,4,5,6)

List1 = (1,3,5)
List2 = (2,4,6)

A. Given the exact main list you provided, one can use the partition function and use the modulo operation to see whether the value is divisible evenly by 2 or not.

scala> val mainList = List(1,2,3,4,5,6)
mainList: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> mainList.partition(_ % 2 == 0)
res0: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))

So, partition returns a pair of Lists. The first has all the elements that match the condition and the second has all the ones that do not.

Of course, this wouldn’t work in general for Lists that have Strings, or that don’t have Ints in order, etc. However, the indices of a List are always well-behaved in this way, so we just need to do a bit more work by zipping each element with its index and then partitioning based on indices.

scala> val unordered = List("b","2","a","4","z","8")
unordered: List[java.lang.String] = List(b, 2, a, 4, z, 8)

scala> unordered.zipWithIndex
res1: List[(java.lang.String, Int)] = List((b,0), (2,1), (a,2), (4,3), (z,4), (8,5))

scala> val (evens, odds) = unordered.zipWithIndex.partition(_._2 % 2 == 0)
evens: List[(java.lang.String, Int)] = List((b,0), (a,2), (z,4))
odds: List[(java.lang.String, Int)] = List((2,1), (4,3), (8,5))

scala> evens.map(_._1)
res2: List[java.lang.String] = List(b, a, z)

scala> odds.map(_._1)
res3: List[java.lang.String] = List(2, 4, 8)

Based on this, you could of course write a function that does this for any arbitrary list.

Q. How to convert a List to a Vector and vice-versa?

A. Use toIndexSeq and toList.

scala> val foo = List(1,2,3,4)
foo: List[Int] = List(1, 2, 3, 4)

scala> val bar = foo.toIndexedSeq
bar: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 2, 3, 4)

scala> val baz = bar.toList
baz: List[Int] = List(1, 2, 3, 4)

scala> foo == baz
res0: Boolean = true

Q. The advantage of a vector over a list is the constant time look-up. What is the advantage of using a list over a vector?

A. A List is slightly faster for operations at the head (front) of the sequence, so if all you are doing is doing a traversal (accessing each element in order, e.g. when mapping), then Lists are perfectly adequate and may be more efficient. They also have some nice pattern matching behavior for case statements.

However, common wisdom seems to be that you should default to using Vectors. See Daniel Spiewak’s nice answer on Stackoverflow:

http://stackoverflow.com/questions/6928327/when-should-i-choose-vector-in-scala

Q. With splitting strings, holmes.split(“\\s”) – \n and \t just requires a single ‘\’ to recognize its special functionality but why two ‘\’s are required for white space character?

A. That’s because \n and \t actually mean something in a String.

scala> println("Here is a line with a tab\tor\ttwo, followed by\na new line.")
Here is a line with a tab    or    two, followed by
a new line.

scala> println("This will break\s.")
<console>:1: error: invalid escape character
println("This will break\s.")

So, you are supplying a String argument to split, and it uses that to construct a regular expression. Given that \s is not a string character, but is a regex metacharacter, you need to escape it. You can of course use split(“”"\s”"”), though that isn’t exactly better in this case.

Q. I have long been programming in C++ and Java. Therefore, I put semicolon at the end of the line unconsciously. It seems that the standard coding style of Scala doesn’t recommend to use semicolons. However, I saw that there are some cases that require semicolons as you showed last class. Is there any specific reason why semicolon loses its role in Scala?

A. The main reason is to improve readability since the semicolon is rarely needed when writing standard code in editors (as opposed to one liners in the REPL). However, when you want to do something in a single line, like handling multiple cases, you need the semicolons.

scala> val foo = List("a",1,"b",2)
foo: List[Any] = List(a, 1, b, 2)

scala> foo.map { case(x: String) => x; case(x: Int) => x.toString }
res5: List[String] = List(a, 1, b, 2)

But, in general, it’s best to just split these cases over multiple lines in any actual code.

Q. Is there no way to use _ in map like methods for collections that consist of pairs? For example, List((1,1),(2,2)).map(e => e._1 + e._2) works, but List((1,1),(2,2)).map(_._1 + _._2) does not work.

A. The scope in which the _ remains unanambigious runs out past its first invocation, so you only get to use it once. It is better anyway to use a case statement that makes it clear what the members of the pairs are.

scala>  List((1,1),(2,2)).map { case(num1, num2) => num1+num2 }
res6: List[Int] = List(2, 4)

Q. I am unsure about the exact meaning of and the difference between “=>” and “->”. They both seem to mean something like “apply X to Y” and I see that each is used in a particular context, but what is the logic behind that?

A. The use of -> simply constructs a Tuple2, as is pretty clear in the following snippet.

scala> val foo = (1,2)
foo: (Int, Int) = (1,2)

scala> val bar = 1->2
bar: (Int, Int) = (1,2)

scala> foo == bar
res11: Boolean = true

Primarily, it is syntactic sugar that provides an intuitive symbol for creating elements of a a Map. Compare the following two ways of declaring the same Map.

scala> Map(("a",1),("b",2))
res9: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

scala> Map("a"->1,"b"->2)
res10: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

The second seems more readable to me.

The use of => indicates that you are defining a function. The basic form is ARGUMENTS => RESULT.

scala> val addOne = (x: Int) => x+1
addOne: Int => Int = <function1>

scala> addOne(2)
res7: Int = 3

scala> val addTwoNumbers = (num1: Int, num2: Int) => num1+num2
addTwoNumbers: (Int, Int) => Int = <function2>

scala> addTwoNumbers(3,5)
res8: Int = 8

Normally, you use it in defining anonymous functions as arguments to functions like map, filter, and such.

Q. Is there a more convenient way of expressing vowels as [AEIOUaeiou] and consonants as [BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz] in RegExes?

A. You can use Strings when defining regexes, so you can have a variable for vowels and one for consonants.

scala> val vowel = "[AEIOUaeiou]"
vowel: java.lang.String = [AEIOUaeiou]

scala> val consonant = "[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]"
consonant: java.lang.String = [BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]

scala> val MyRE = ("("+vowel+")("+consonant+")("+vowel+")").r
MyRE: scala.util.matching.Regex = ([AEIOUaeiou])([BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz])([AEIOUaeiou])

scala> val MyRE(x,y,z) = "aJE"
x: String = a
y: String = J
z: String = E

Q. The “\b” in RegExes marks a boundary, right? So, it also captures the “-”. But if I have a single string “sdnfeorgn”, it does NOT capture the boundaries of that, is that correct? And if so, why doesn’t it?

A. Because there are no boundaries in that string!

Intermediate questions

Q. The flatMap function takes lists of lists and merges them to single list. But in the example:

scala> (1 to 10).toList.map(x=>squareOddNumber(x))
res16: List[Option[Int]] = List(Some(1), None, Some(9), None, Some(25), None, Some(49), None, Some(81), None)

scala> (1 to 10).toList.flatMap(x=>squareOddNumber(x))
res17: List[Int] = List(1, 9, 25, 49, 81)

Here it is not list of list but just a list. In this case it expects the list to be Option list.
I tried running the code with function returning just number or None. It showed error. So is there any way to use flatmap without Option lists and just list. For example, List(1, None, 9, None, 25) should be returned as List(1, 9, 25).

A. No, this won’t work because List(1, None, 9, None, 25) mixes Options with Ints.

scala> val mixedup = List(1, None, 9, None, 25)
mixedup: List[Any] = List(1, None, 9, None, 25)

So, you should have your function return an Option which means returning Somes or Nones. Then flatMap will work happily.

One way of think of Options is that they are like Lists with zero or one element, as can be noted by the parallels in the following snippet.

scala> val foo = List(List(1),Nil,List(3),List(6),Nil)
foo: List[List[Int]] = List(List(1), List(), List(3), List(6), List())

scala> foo.flatten
res12: List[Int] = List(1, 3, 6)

scala> val bar = List(Option(1),None,Option(3),Option(6),None)
bar: List[Option[Int]] = List(Some(1), None, Some(3), Some(6), None)

scala> bar.flatten
res13: List[Int] = List(1, 3, 6)

Q. Does scala have generic templates (like C++, Java)? eg. in C++, we can use vector<int>, vector<string> etc. Is that possible in scala? If so, how?

A. Yes, every collection type is parameterized. Notice that each of the following variables is parameterized by the type of the elements they are initialized with.

scala> val foo = List(1,2,3)
foo: List[Int] = List(1, 2, 3)

scala> val bar = List("a","b","c")
bar: List[java.lang.String] = List(a, b, c)

scala> val baz = List(true, false, true)
baz: List[Boolean] = List(true, false, true)

You can create your own parameterized classes straightforwardly.

scala> class Flexible[T] (val data: T)
defined class Flexible

scala> val foo = new Flexible(1)
foo: Flexible[Int] = Flexible@7cd0570e

scala> val bar = new Flexible("a")
bar: Flexible[java.lang.String] = Flexible@31b6956f

scala> val baz = new Flexible(true)
baz: Flexible[Boolean] = Flexible@5b58539f

scala> foo.data
res0: Int = 1

scala> bar.data
res1: java.lang.String = a

scala> baz.data
res2: Boolean = true

Q. How can we easily create, initialize and work with multi-dimensional arrays (and dictionaries)?

A. Use the fill function of the Array object to create them.

scala> Array.fill(2)(1.0)
res8: Array[Double] = Array(1.0, 1.0)

scala> Array.fill(2,3)(1.0)
res9: Array[Array[Double]] = Array(Array(1.0, 1.0, 1.0), Array(1.0, 1.0, 1.0))

scala> Array.fill(2,3,2)(1.0)
res10: Array[Array[Array[Double]]] = Array(Array(Array(1.0, 1.0), Array(1.0, 1.0), Array(1.0, 1.0)), Array(Array(1.0, 1.0), Array(1.0, 1.0), Array(1.0, 1.0)))

Once you have these in hand, you can iterate over them as usual.

scala> val my2d = Array.fill(2,3)(1.0)
my2d: Array[Array[Double]] = Array(Array(1.0, 1.0, 1.0), Array(1.0, 1.0, 1.0))

scala> my2d.map(row => row.map(x=>x+1))
res11: Array[Array[Double]] = Array(Array(2.0, 2.0, 2.0), Array(2.0, 2.0, 2.0))

For dictionaries (Maps), you can use mutable HashMaps to create an empty Map and then add elements to it. For that, see this blog post:

http://bcomposes.wordpress.com/2011/09/19/first-steps-in-scala-for-beginning-programmers-part-8/

Q. Is the apply function similar to constructor in C++, Java? Where will the apply function be practically used? Is it for intialising values of attributes?

A. No, the apply function is like any other function except that it allows you to call it without writing out “apply”. Consider the following class.

class AddX (x: Int) {
  def apply(y: Int) = x+y
  override def toString = "My number is " + x
}

Here’s how we can use it.

scala> val add1 = new AddX(1)
add1: AddX = My number is 1

scala> add1(4)
res0: Int = 5

scala> add1.apply(4)
res1: Int = 5

scala> add1.toString
res2: java.lang.String = My number is 1

So, the apply method is just (very handy) syntactic sugar that allows you to specify one function as fundamental to a class you have designed (actually, you can have multiple apply methods as long as each one has a unique parameter list). For example, with Lists, the apply method returns the value at the index provided, and for Maps it returns the value associated with the given key.

scala> val foo = List(1,2,3)
foo: List[Int] = List(1, 2, 3)

scala> foo(2)
res3: Int = 3

scala> foo.apply(2)
res4: Int = 3

scala> val bar = Map(1->2,3->4)
bar: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

scala> bar(1)
res5: Int = 2

scala> bar.apply(1)
res6: Int = 2

Q. In the SBT tutorial you discuss “Node” and “Value” as being case classes. What is the alternative to a case class?

A. A normal class. Case classes are the special case. They do two things (and more) for you. The first is that you don’t have to use “new” to create a new object. Consider the following otherwise identical classes.

scala> class NotACaseClass (val data: Int)
defined class NotACaseClass

scala> case class IsACaseClass (val data: Int)
defined class IsACaseClass

scala> val foo = new NotACaseClass(4)
foo: NotACaseClass = NotACaseClass@a5c0f8f

scala> val bar = IsACaseClass(4)
bar: IsACaseClass = IsACaseClass(4)

That may seem like a little thing, but it can significantly improve code readability. Consider creating Lists within Lists within Lists if you had to use “new” all the time, for example. This is definitely true for Node and Value, which are used to build trees.

Case classes also support matching, as in the following.

scala> val IsACaseClass(x) = bar
x: Int = 4

A normal class cannot do this.

scala> val NotACaseClass(x) = foo
<console>:13: error: not found: value NotACaseClass
val NotACaseClass(x) = foo
^
<console>:13: error: recursive value x needs type
val NotACaseClass(x) = foo
^

If you mix the case class into a List and map over it, you can match it like you can with other classes, like Lists and Ints. Consider the following heterogeneous List.

scala> val stuff = List(IsACaseClass(3), List(2,3), IsACaseClass(5), 4)
stuff: List[Any] = List(IsACaseClass(3), List(2, 3), IsACaseClass(5), 4)

We can convert this to a List of Ints by processing each element according to its type by matching.

scala> stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x }
<console>:13: warning: match is not exhaustive!
missing combination              *           Nil             *             *

stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x }
^

warning: there were 1 unchecked warnings; re-run with -unchecked for details
res10: List[Any] = List(3, 2, 5, 4)

If you don’t want to see the warning in the REPL, add a case for things that don’t match that throws a MatchError.

scala> stuff.map { case List(x,y) => x; case IsACaseClass(x) => x; case x: Int => x; case _ => throw new MatchError }
warning: there were 1 unchecked warnings; re-run with -unchecked for details
res13: List[Any] = List(3, 2, 5, 4)

Better yet, return Options (using None for the unmatched case) and flatMapping instead.

scala> stuff.flatMap { case List(x,y) => Some(x); case IsACaseClass(x) => Some(x); case x: Int => Some(x); case _ => None }
warning: there were 1 unchecked warnings; re-run with -unchecked for details
res14: List[Any] = List(3, 2, 5, 4)

Q. In C++ the default access specifier is private; in Java one needs to specify private or public for each class member where as in Scala the default access specifier for a class is public. What could be the design motivation behind this when one of the purpose of the class is data hiding?

A. The reason is that Scala has a much more refined access specification scheme than Java that makes public the rational choice. See the discussion here:

http://stackoverflow.com/questions/4656698/default-public-access-in-scala

Another key aspecte of this is that the general emphasis in Scala is on using immutable data structures, so there isn’t any danger of someone changing the internal state of your objects if you have designed them in this way. This in turn gets rid of the ridiculous getter and setter methods that breed and multiply in Java programs. See “Why getters and setters are evil” for more discussion:

http://www.javaworld.com/javaworld/jw-09-2003/jw-0905-toolbox.html

After you get used to programming in Scala, the whole getter/setter thing that is so common in Java code is pretty much gag worthy.

In general, it is still a good idea to use private[this] as a modifier to methods and variables whenever they are only needed by an object itself.

Q. How do we define overloaded constructors in Scala?

Q. The way a class is defined in Scala introduced in the tutorial, seems to have only one constructor. Is there any way to provide multiple constructors like Java?

A. You can add additional constructors with this declarations.

class SimpleTriple (x: Int, y: Int, z: String) {
  def this (x: Int, z: String) = this(x,0,z)
  def this (x: Int, y: Int) = this(x,y,"a")
  override def toString = x + ":" + y + ":" + z
}

scala> val foo = new SimpleTriple(1,2,"hello")
foo: SimpleTriple = 1:2:hello

scala> val bar = new SimpleTriple(1,"goodbye")
bar: SimpleTriple = 1:0:goodbye

scala> val baz = new SimpleTriple(1,3)
baz: SimpleTriple = 1:3:a

Notice that you must supply an initial value for every one of the parameters of the class. This contrasts with Java, which allows you to leave some fields uninitialized (and which tends to lead to nasty bugs and bad design).

Note that you can also provide defaults to parameters.

class SimpleTripleWithDefaults (x: Int, y: Int = 0, z: String = "a") {
  override def toString = x + ":" + y + ":" + z
}

scala> val foo = new SimpleTripleWithDefaults(1)
foo: SimpleTripleWithDefaults = 1:0:a

scala> val bar = new SimpleTripleWithDefaults(1,2)
bar: SimpleTripleWithDefaults = 1:2:a

However, you can’t omit a middle parameter while specifying the last one.

scala> val foo = new SimpleTripleWithDefaults(1,"xyz")
<console>:12: error: type mismatch;
found   : java.lang.String("xyz")
required: Int
Error occurred in an application involving default arguments.
val foo = new SimpleTripleWithDefaults(1,"xyz")
^

But, you can name the parameters in the initialization if you want to be able to do this.

scala> val foo = new SimpleTripleWithDefaults(1,z="xyz")
foo: SimpleTripleWithDefaults = 1:0:xyz

You then have complete freedom to change the parameters around.

scala> val foo = new SimpleTripleWithDefaults(z="xyz",x=42,y=3)
foo: SimpleTripleWithDefaults = 42:3:xyz

Q. I’m still not clear on the difference between classes and traits.  I guess I see a conceptual difference but I don’t really understand what the functional difference is — how is creating a “trait” different from creating a class with maybe fewer methods associated with it?

A. Yes, they are different. First off, traits are abstract, which means you cannot create any members. Consider the following contrast.

scala> class FooClass
defined class FooClass

scala> trait FooTrait
defined trait FooTrait

scala> val fclass = new FooClass
fclass: FooClass = FooClass@1b499616

scala> val ftrait = new FooTrait
<console>:8: error: trait FooTrait is abstract; cannot be instantiated
val ftrait = new FooTrait
^

You can extend a trait to make a concrete class, however.

scala> class FooTraitExtender extends FooTrait
defined class FooTraitExtender

scala> val ftraitExtender = new FooTraitExtender
ftraitExtender: FooTraitExtender = FooTraitExtender@53d26552

This gets more interesting if the trait has some methods, of course. Here’s a trait, Animal, that declares two abstract methods, makeNoise and doBehavior.

trait Animal {
  def makeNoise: String
  def doBehavior (other: Animal): String
}

We can extend this trait with new class definitions; each extending class must implement both of these methods (or else be declared abstract).

case class Bear (name: String, defaultBehavior: String = "Regard warily...") extends Animal {
  def makeNoise = "ROAR!"
  def doBehavior (other: Animal) = other match {
    case b: Bear => makeNoise + " I'm " + name + "."
    case m: Mouse => "Eat it!"
    case _ => defaultBehavior
  }
  override def toString = name
}

case class Mouse (name: String) extends Animal {
  def makeNoise = "Squeak?"
  def doBehavior (other: Animal) = other match {
    case b: Bear => "Run!!!"
    case m: Mouse => makeNoise + " I'm " + name + "."
    case _ => "Hide!"
  }
  override def toString = name
}

Notice that Bear and Mouse have different parameter lists, but both can be Animals because they fully implement the Animal trait. We can now start creating objects of the Bear and Mouse classes and have them interact. We don’t need to use “new” because they are case classes (and this also allowed them to be used in the match statements of the doBehavior methods).

val yogi = Bear("Yogi", "Hello!")
val baloo = Bear("Baloo", "Yawn...")
val grizzly = Bear("Grizzly")
val stuart = Mouse("Stuart")

println(yogi + ": " + yogi.makeNoise)
println(stuart + ": " + stuart.makeNoise)
println("Grizzly to Stuart: " + grizzly.doBehavior(stuart))

We can also create a singleton object that is of the Animal type by using the following declaration.

object John extends Animal {
  def makeNoise = "Hullo!"
  def doBehavior (other: Animal) = other match {
    case b: Bear => "Nice bear... nice bear..."
    case _ => makeNoise
  }
  override def toString = "John"
}

Here, John is an object, not a class. Because this object implements the Animal trait, it successfully extends it and can act as an Animal. This means that a Bear like baloo can interact with John.

println("Baloo to John: " + baloo.doBehavior(John))

The output of the above code when run as a script is the following.

Yogi: ROAR!
Stuart: Squeak?
Grizzly to Stuart: Eat it!
Baloo to John: Yawn…

The closer distinction is between traits and abstract classes. In fact, everything shown above could have been done with Animal as an abstract class rather than as a trait. One difference is that an abstract class can have a constructor while traits cannot. Another key difference between them is that traits can be used to support limited multiple inheritance, as shown in the next question/answer.

Q. Does Scala support multiple inheritance?

A. Yes, via traits with implementations of some methods. Here’s an example, with a trait Clickable that has an abstract (unimplemented) method getMessage, an implemented method click, and a private, reassignable variable numTimesClicked (the latter two show clearly that traits are different from Java interfaces).

trait Clickable {
  private var numTimesClicked = 0
  def getMessage: String
  def click = {
    val output = numTimesClicked + ": " + getMessage
    numTimesClicked += 1
    output
  }
}

Now let’s say we have a MessageBearer class (that we may have wanted for entirely different reasons having nothing to do with clicking).

class MessageBearer (val message: String) {
  override def toString = message
}

A new class can be now created by extending MessageBearer and “mixing in” the Clickable trait.

class ClickableMessageBearer(message: String) extends MessageBearer(message) with Clickable {
  def getMessage = message
}

ClickableMessageBearer now has the abilities of both MessageBearers (which is to be able to retrieve its message) and Clickables.

scala> val cmb1 = new ClickableMessageBearer("I'm number one!")
cmb1: ClickableMessageBearer = I'm number one!

scala> val cmb2 = new ClickableMessageBearer("I'm number two!")
cmb2: ClickableMessageBearer = I'm number two!

scala> cmb1.click
res3: java.lang.String = 0: I'm number one!

scala> cmb1.message
res4: String = I'm number one!

scala> cmb1.click
res5: java.lang.String = 1: I'm number one!

scala> cmb2.click
res6: java.lang.String = 0: I'm number two!

scala> cmb1.click
res7: java.lang.String = 2: I'm number one!

scala> cmb2.click
res8: java.lang.String = 1: I'm number two!

Q. Why are there toString, toInt, and toList functions, but there isn’t a toTuple function?

A. This is a basic question that leads directly to the more advanced topic of implicits. There are a number of reasons behind this. To start with, it is important to realize that there are many types of Tuples, starting with a Tuple with a single element (a Tuple1) up to 22 elements (a Tuple22). Note that when you use (,) to create a tuple, it is implicitly invoking a constructor for the corresponding TupleN of the correct arity.

scala> val b = (1,2,3)
b: (Int, Int, Int) = (1,2,3)

scala> val c = Tuple3(1,2,3)
c: (Int, Int, Int) = (1,2,3)

scala> b==c
res4: Boolean = true

Given this, it is obviously not meaningful to have a function toTuple on Seqs (sequences) that are longer than 22. This means there is no generic way to have, say a List or Array, and then call toTuple on it and expect reliable behavior to happen.

However, if you want this functionality (even though limited by the above constraint of 22 elements max), Scala allows you to “add” methods to existing classes by using implicit definitions. You can find lots of discussions about implicits by search for “scala implicits”. But, here’s an example that shows how it works for this particular case.

val foo = List(1,2)
val bar = List(3,4,5)
val baz = List(6,7,8,9)

foo.toTuple

class TupleAble[X] (elements: Seq[X]) {
  def toTuple = elements match {
    case Seq(a) => Tuple1(a)
    case Seq(a,b) => (a,b)
    case Seq(a,b,c) => (a,b,c)
    case _ => throw new RuntimeException("Sequence too long to be handled by toTuple: " + elements)
  }
}

foo.toTuple

implicit def seqToTuple[X](x: Seq[X]) = new TupleAble(x)

foo.toTuple
bar.toTuple
baz.toTuple

If you put this into the Scala REPL, you’ll see that the first invocation of foo.toTuple gets an error:

scala> foo.toTuple
<console>:9: error: value toTuple is not a member of List[Int]
foo.toTuple
^

Note that class TupleAble takes a Seq in its constructor and then provides the method toTuple, using that Seq. It is able to do so for Seqs with 1, 2 or 3 elements, and above that it throws an exception. (We could of course keeping listing more cases out and go up to 22 element tuples, but this shows the point.)

The second invocation of foo.toTuple still doesn’t work — and that is because foo is a List (a kind of Seq) and there isn’t a toTuple method for Lists. That’s where the implicit function seqToTuple comes in — once it is declared, Scala notes that you are trying to call toTuple on a Seq, notes that there is no such function for Seqs, but sees that there is an implicit conversion from Seqs to TupleAbles via seqToTuple, and then it sees that TupleAble has a toTuple method. Based on that, it compiles and the produces the desired behavior. This is a very handy ability of Scala that can really simplify your code if you use it well and with care.

Copyright 2012 Jason Baldridge

The text of this post is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original post.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: iteration, mapping, for expressions, foreach loops, Lists, ListBuffers, Arrays, indexed sequences, recursion

Introduction

A common question from students who are new to Scala is: What is the difference between using the map function on lists, using for expressions and foreach loops? One of the major sources of confusion with regard to this question is that a for expression in Scala in not the equivalent of for loops in languages like Python and Java — instead, the equivalent of for loops is foreach in Scala. This distinction highlights the importance of understanding what it means to return values versus relying on side-effects to perform certain computations. It also helps reinforce some points about fixed versus reassignable variables and immutable versus mutable data structures.

The task and its functional solution

To demonstrate this, let’s consider a simple task. Given a List of words, compute two lists: one has the lengths of each word and the second indicates whether a word starts with a capital letter or not. For example, start with the following list.

scala> val words = List("This", "is", "a", "list", "of", "English", "words", ".")
words: List[java.lang.String] = List(This, is, a, list, of, English, words, .)

We can compute the two lists by mapping over the words list as follows.

scala> words.map(_.length)
res0: List[Int] = List(4, 2, 1, 4, 2, 7, 5, 1)

scala> words.map(_(0).isUpper)
res1: List[Boolean] = List(true, false, false, false, false, true, false, false)

So, that’s it. However, let’s do this without using different calls to the map function (or multiple foreach loops, as we’ll see below). The easiest way to do this is to map each word to a tuple containing the length and the Boolean indicating whether its first character is capitalized; this produces a list of tuples, which we unzip to get a tuple of Lists.

scala> val (wlengthsMapUnzip, wcapsMapUnzip) =
|   words.map(word => (word.length, word(0).isUpper)).unzip
wlengthsMapUnzip: List[Int] = List(4, 2, 1, 4, 2, 7, 5, 1)
wcapsMapUnzip: List[Boolean] = List(true, false, false, false, false, true, false, false)

The key thing here is that the map function turns the List[String] words into a List[(Int, Boolean)] — which is to say it returns a value. We can assign that value to a variable, or use it immediately by calling unzip on it, which in turn returns a value that is a Tuple2(List[Int],List[Boolean]).

Before moving on let’s define a simple function to display the results of performing this computation, which we will do in various ways (and which all produce precisely the same results).

def display (intro: String, wlengths: List[Int], wcaps: List[Boolean]) {
  println(intro)
  println("Lengths: " + wlengths.mkString(" "))
  println("Caps: " + wcaps.mkString(" "))
  println
}

Calling this function with the result of mapping and unzipping as above, we get the following output.

scala> display("Using map and unzip.", wlengthsMapUnzip, wcapsMapUnzip)
Using map and unzip.
Lengths: 4 2 1 4 2 7 5 1
Caps: true false false false false true false false

Okay, so now let’s start doing it the hard way. Rather than mapping over the original list, we’ll loop over the list with foreach, and perform a side-effect computation that builds up the two result sequences. This is the sort of thing that is typically done in non-functional languages with for loops, hence the use of foreach in Scala. We’ll explore each of these in turn.

The second variation: use reassignable, immutable Lists

We can use reassignable variables which are initialized to be empty Lists, and then prepend to them as we loop through the words list. We are thus using a variable that has the type of List, which is an immutable sequence data structure, but its value is being reassigned each time we pass through the loop.

var wlengthsReassign = List[Int]()
var wcapsReassign = List[Boolean]()
words.foreach { word =>
  wlengthsReassign = word.length :: wlengthsReassign
  wcapsReassign = word(0).isUpper :: wcapsReassign
}

display("Using reassignable lists.", wlengthsReassign.reverse, wcapsReassign.reverse)

Note that we build up the lists by prepending, which means they come out of the loop in reverse order and thus must be reversed before being displayed. You can of course append to a List by creating a singleton List and concatenating the two Lists with the ::: operator.

scala> val foo = List(4,2)
foo: List[Int] = List(4, 2)

scala> foo ::: List(7)
res0: List[Int] = List(4, 2, 7)

However, this is not recommended because it is computationally costly. Adding an element to the front (left) of a List is a constant time operation, whereas concatenating two lists requires time proportional to the length of the first list. That might not seem like a big deal until you are dealing with lists with thousands of elements, and then you’ll find that the same bit of code that prepends many times and then reverses is much faster than one which appends using the above strategy.

Third variation: use unreassignable, mutable (growable) ListBuffers

Next, we can use a ListBuffer, which is a mutable sequence data structure that also happens to support constant time append operations. We can thus declare it as a val, and then use the method append to mutate the sequence so that it has a new element at the end. So, the variables referring to the sequences are not reassignable, but their values are mutable.

import collection.mutable.ListBuffer
val wlengthsBuffer = ListBuffer[Int]()
val wcapsBuffer = ListBuffer[Boolean]()
words.foreach { word =>
  wlengthsBuffer.append(word.length)
  wcapsBuffer.append(word(0).isUpper)
}

display("Using mutable ListBuffer.", wlengthsBuffer.toList, wcapsBuffer.toList)

Note that we must convert the ListBuffers to Lists for the call to display in order to have the right types as arguments to that function.

Since they can efficiently grow (i.e., get longer), ListBuffers are a good choice for many problems where we need to accumulate a set of results, and especially when we don’t know how many results we will be accumulating. However, if you know the number of results you’ll be accumulating it’s probably better to use Arrays, as shown next.

Fourth variation: use unreassignable, mutable (but fixed length) Arrays

Both of the above alternatives probably look a little strange to people coming from Java. In Java, you’d be more likely to do an imperative solution that involves initializing arrays that have the same length as words and then filling in respective indices as appropriate. To do this, use Array.fill(lengthOfArray)(initialValue).

val wlengthsArray = Array.fill(words.length)(0)
val wcapsArray = Array.fill(words.length)(false)
  words.indices.foreach { index =>
  wlengthsArray(index) = words(index).length
  wcapsArray(index) = words(index)(0).isUpper
}

display("Using iteration and arrays.", wlengthsArray.toList, wcapsArray.toList)

We go through the indices and for each one compute the value and assign it to the appropriate index in the corresponding Array. Again, we need to convert the results to Lists before calling display. The indices method does exactly what you’d expect — it gives you the indices of the List.

scala> words.indices
res2: scala.collection.immutable.Range = Range(0, 1, 2, 3, 4, 5, 6, 7)

A problem with the above foreach loop is that it requires indexing into Lists, which is generally a bad idea. Why? Because to get the i-th item from a list requires time proportional to i operations. Why? Because the implementation for obtaining an item at a particular index i involves peeling off the head of the list to get its tail, and then seeking for the i-1-th item of the tail, which requires peeling off its head and then seeking for the i-2-th item, and so on. So, if you want to get the 10000th item in a list, you have to perform 10,000 operations to get it. If the words list had 10,000 elements, you can now see that you’d perform 10,000 basic computations just on the foreach, and for each element you do 2*index operations to get the word at that index, which means doing 20,000 operations on the last index alone.

Note that indexing into Arrays is a constant time operation, so there is no problem with the left hand side of the assignments in the above loop.

You might think you can do better by first storing the word and then using it twice, e.g.

words.indices.foreach { index =>
  val word = words(index)
  wlengthsArray(index) = word.length
  wcapsArray(index) = word(0).isUpper
}

This is better, but it only saves us half the operations. Since we were perfectly happy to loop over the words themselves before, we actually shouldn’t have to do this look up — we can do better by having a reassignable counter index that allows us to set values to the correct positions in the new Arrays we are creating.

val wlengthsArray2 = Array.fill(words.length)(0)
val wcapsArray2 = Array.fill(words.length)(false)
var index = 0
words.foreach { word =>
  wlengthsArray2(index) = word.length
  wcapsArray2(index) = word(0).isUpper
  index += 1
}

Since this sort of pattern is fairly common, Scala provides a handy method on sequences called zipWithIndex which returns a List of the original elements paired with their indices.

scala> words.zipWithIndex
res3: List[(java.lang.String, Int)] = List((This,0), (is,1), (a,2), (list,3), (of,4), (English,5), (words,6), (.,7))

In this way, we can have the foreach loop over such pairs. It is convenient in these cases to use the pattern matching abilities in foreach loops by using the case match on pairs, as below.

val wlengthsArray3 = Array.fill(words.length)(0)
val wcapsArray3 = Array.fill(words.length)(false)
words.zipWithIndex.foreach { case(word,index) =>
  wlengthsArray3(index) = word.length
  wcapsArray3(index) = word(0).isUpper
}

It’s important to understand the cost of the operations you are using, especially in looping contexts where you are inherently doing the same basic operation multiple times.

Indexed sequences (Vectors)

It is worth pointing out that when you want an immutable sequence that allows efficient indexing, you should use Vectors.

scala> val bar = Vector(1,2,3)
bar: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)

If you have a List in hand but want to index into it repeatedly, you can convert it to a Vector using toIndexedSeq.

scala> val numbers = List(4,9,9,2,3,8)
numbers: List[Int] = List(4, 9, 9, 2, 3, 8)

scala> numbers.toIndexedSeq
res5: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 9, 9, 2, 3, 8)

IndexedSeq is a supertype of sequences which are designed to be efficient for indexing, and Vector is the default “backing” implementation when you call toIndexedSeq on a List.

Of course, if you are only ever going over all the elements of a sequence in order, then Lists are likely to be preferable since they have a bit less overhead and they have some nice properties for pattern matching in match statements.

Using predefined funtions for mapping over a sequence

Another thing worth pointing out is that if you have a predefined function, you can pass that as the argument to map, which can lead to very concise code for this task. Assume you have defined a function that takes a String and produces a Tuple of its argument’s length and whether it starts with an upper-case letter.

def getLengthAndUpper = (word: String) => (word.length, word(0).isUpper)

The code for mapping over words with this function to get our desired lists is then very clean.

val (wlengthsFunction, wcapsFunction) = words.map(getLengthAndUpper).unzip

Of course, you would probably only do this if you needed that same function in other places. If not, it’s preferable to just use the anonymous function like in the first map example in this blog post. However, you can see that if you have a library of simple functions like this, you can now start writing much clearer and simpler code by reusing them when mapping over different lists.

For expressions

Notice that the previous loops were all foreach ones, whereas Java programmers and Pythonistas will be used to for loops. Scala doesn’t have for loops — it has for expressions. A common question then is: What’s the difference? What is a for expression for and why isn’t it a for loop? The difference is that an expression returns a value, so while foreach allows you to plow through a sequence and do some operation to each element, a for expression allows you to return a value for each element. Consider the following, in which we yield the square of each integer in a List[Int].

scala> val numbers = List(4,9,9,2,3,8)
numbers: List[Int] = List(4, 9, 9, 2, 3, 8)

scala> for (num <- numbers) yield num*num
res6: List[Int] = List(16, 81, 81, 4, 9, 64)

We get a result, whereas a foreach loop just does the computation and returns nothing.

scala> numbers.foreach { num => num * num }

The key is that we yield a value for each element in the for expression. In this case, it is basically equivalent to using map. Here it is in the context of the running words example.

val (wlengthsFor, wcapsFor) =
  (for (word <- words) yield (word.length, word(0).isUpper)).unzip

display("Using a for expression.", wlengthsFor, wcapsFor)

Having said all this, it turns out you can use a for expression as a loop, without returning any values, e.g. as follows.

scala> for (num <- numbers) { println(num*num) }
16
81
81
4
9
64

I think it is generally better to use a foreach loop for such cases so that it is clear that you are only performing side-effects, like printing, reassigning the values of var variables, or modifying mutable data structures. However, there are some cases where a for expression can be more convenient, for example when working through multiple lists and doing various filtering operations. Here’s a quick example to give a flavor of this. Given two lists, we can enumerate the cross product of all their elements

scala> val numbers = List(4,9,9,2,3,8)
numbers: List[Int] = List(4, 9, 9, 2, 3, 8)

scala> val letters = List('a','C','f','d','z')
letters: List[Char] = List(a, C, f, d, z)

scala> for (n <- numbers; l <- letters) print("(" + n + "," + l + ") ")
(4,a) (4,C) (4,f) (4,d) (4,z) (9,a) (9,C) (9,f) (9,d) (9,z) (9,a) (9,C) (9,f) (9,d) (9,z) (2,a) (2,C) (2,f) (2,d) (2,z) (3,a) (3,C) (3,f) (3,d) (3,z) (8,a) (8,C) (8,f) (8,d) (8,z)

You can filter on these values as well to restrict the output to just some reduced set of elements of inter(est.

scala> for (n <- numbers; if (n>4); l <- letters) print("(" + n + "," + l + ") ")
(9,a) (9,C) (9,f) (9,d) (9,z) (9,a) (9,C) (9,f) (9,d) (9,z) (8,a) (8,C) (8,f) (8,d) (8,z)

There is much more to this, but I’ll leave it here since using for expressions in this way is a rich enough topic for several blog posts in and of itself. Also, there is a detailed discussion of it in Odersky, Spoon, and Venner’s book “Programming in Scala.”

Fifth variation: use a recursive function

It’s worth pointing out one other way of building up lengths and caps lists. Recursive functions are functions which look at their input and then either return a result for a base case or compute a result and then call themself with that result. It’s pretty standard stuff that computer scientists love and which tends to get used a lot more in functional programming than in imperative programming. Here, I’ll show how to do the same task done before using recursion, but without an in-depth explanation, so either you’ll already know how to do recursion and you can see it in Scala for the same problem context as above, or you don’t know much about recursion but can use this as an example of how it is employed for a task you already understand from the vantages given above. So, in the later case, hopefully it will be useful in conjunction with other tutorials on recursion.

First, we need to define the recursive function, given below. It has three parameters: one for the list of words, one for the already computed lengths and another for the already computed caps. It returns a pair that has first the list of lengths with one additional item prepended to it and then the list of caps values with one additional item prepended to it. The items being prepended are computed from the head of the inputWords list.

def lengthCapRecursive(
  inputWords: List[String],
  lengths: List[Int],
  caps: List[Boolean]): (List[Int], List[Boolean]) = inputWords match {

  case Nil =>
    (lengths, caps)
  case head :: tail =>
    lengthCapRecursive(tail, head.length :: lengths, head(0).isUpper :: caps)
}

We can call this function directly, but it is often convenient to provide a secondary function that makes the initial call to this function with empty result lists as the second and third parameters. The secondary function can then perform the reversal and return the desired computed lists.

def lengthCapRecursive(inputWords: List[String]): (List[Int], List[Boolean]) = {
val (l,c) = lengthCapRecursive(words, List[Int](), List[Boolean]())
(l.reverse, c.reverse)
}

Getting the result is then just a matter of calling that function with our words list.

val (wlengthsRecursive, wcapsRecursive) = lengthCapRecursive(words)

display("Using a recursive function.", wlengthsRecursive, wcapsRecursive)

A slight variation on this that is slightly cleaner is to “hide” the recursive function inside the secondary function, which then effectively acts as a wrapper to the recursive function. This is often considered cleaner because the programmer can ensure that the initialization is done correctly and that the recursive function itself isn’t given malformed inputs.

def lengthCapRecurWrap(inputWords: List[String]): (List[Int], List[Boolean]) = {

  // This function is hidden from code that doesn't
  def lengthCapRecurHelp(
    inputWords: List[String],
    lengths: List[Int],
    caps: List[Boolean]): (List[Int], List[Boolean]) = inputWords match {

    case Nil =>
      (lengths, caps)
    case head :: tail =>
      lengthCapRecurHelp(tail, head.length :: lengths, head(0).isUpper :: caps)
  }

  val (l,c) = lengthCapRecursive(words, List[Int](), List[Boolean]())
  (l.reverse, c.reverse)

}

val (wlengthsRecurWrap, wcapsRecurWrap) = lengthCapRecurWrap(words)

display("Using a recursive function contained in a wrapper.", wlengthsRecurWrap, wcapsRecurWrap)

Conclusion

So, that provides an overview of different ways of obtaining the same results and some explanation of the different properties of each solution in terms of computational considerations that are likely to crop up in your code and you should be aware of.

Clearly there are many ways of getting the same thing done in Scala. This can be hard for newcomers to the language since they don’t have good intuitions about which approach is better in different circumstances, but it is quite valuable to have these options as you become more savvy and understand what the costs and benefits of using different data structures and different ways of iterating are.

All of the code from the above snippets are gathered together in the Github gist ListComputations.scala. You can save it as a file and run it as “scala ListComputations.scala“  to see the output and play around with modifications to the code.

It’s Christmas day, and I thought I’d share a wee thing of interest that came my way this week. I was a graduate student at the University of Edinburgh in Scotland from 1999 to 2002, and during that time I purchased a great deal of my clothes at thrift shops around the town — it definitely made the pounds/dollars stretch quite a bit more. In addition to clothes, these shops sold used books, and I sometimes couldn’t resist picking one or two up. One of those was Watership Down, a children’s book about rabbits at war (really!) that I had loved as a kid. I happened to pick it up from my bookshelf the other day. I leafed through it, curious to see how long I might have to wait until I can read it to my now two-and-a-half year old son. Inside, I found a bit of history in the form of the following raffle ticket for the Edinburgh Wanderers rugby union team, from 1978:

Edinburgh Wanderers 1978 Raffle Ticket

It turns out that the Edinburgh Wanderers is no more: they merged with Murrayfield RFC to become the Murrayfield Wanderers RFC in 1997. They play at Murrayfield Stadium, which I biked past from time to time on my way to rent cars to take out into the Scottish Highlands (one of the many great things about living and studying in Edinburgh).

I love the second and third prizes, and the “etc., etc.” reduplication. What is a “giant food hamper” anyway?! Exactly how big is “giant” in that context? And a gallon of whisky doesn’t sound like a bad third prize, depending on which distillery made it. It might not have been too shabby, given that 100 pounds in 1978 would be worth 450 to 750 pounds today. Some Caol Ila, perhaps?

I can only presume that this ticket was not a winner and was thus relegated to being a bookmark… I’ll be happy to keep using it as such for myself now.

With that, I simply say: a very Merry Christmas and a Grand Christmas Draw to all!

Topics: code blocks, coding style, closures, scala documentation project

Preface

This is part 12 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post isn’t so much a tutorial as a comment on coding style with a few pointers on how code blocks in Scala work. It was instigated by patterns I was noting in my students’ code; namely, that they were packing everything into one-liners with map after map with map after map, etc. These  map-over-mapValues-over-map sequences of statements can be almost incomprensible, both for some other person reading the code, and even for the person writing the code. I do admit to a fair amount of guilt in using such sequences of operations in class lectures and even in some of these tutorials. It works well in the REPL and when you have lots of text to explain what is going on around the piece of code in question, but it seems to have given a bad model for writing actual code. Oops!

So taking a step back, it is important to break operation sequences up a bit, but it isn’t always obvious to beginners how one can do so. Also, some students indicated that they had gotten the impression that one should try to pack everything onto one line if possible, and that breaking things up was somehow less advanced or less Scala-like. This is hardly the case. In fact much to the contrary: it is crucial to use strategies that allow readers of your code to see the logic behind your statements. This isn’t just for others — you are likely to be a reader of your own code, often months after you originally wrote it, and you want to be kind to your future self.

A simple example

I’m giving an example here. of what you can do to give your code more breathing space. It’s not a very meaningful example, but it serves the purpose without being very complex. We begin by creating a list of all the letters in the alphabet.


scala> val letters = "abcdefghijklmnopqrstuvwxyz".split("").toList.tail
letters: List[java.lang.String] = List(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)

Okay, now here’s our (pointless) task: we want to create a map from every letter (from ‘a’ to ‘x’) to a list containing that letter and the two letters that follow it in reverse alphabetical order. (Did I mention this was a pointless task in and of itself?) Here’s a one-liner that can do it.


scala> letters.zip((1 to 26).toList.sliding(3).toList).toMap.mapValues(_.map(x => letters(x-1)).sorted.reverse)
res0: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] = Map(e -> List(g, f, e), s -> List(u, t, s), x -> List(z, y, x), n -> List(p, o, n), j -> List(l, k, j), t -> List(v, u, t), u -> List(w, v, u), f -> List(h, g, f), a -> List(c, b, a), m -> List(o, n, m), i -> List(k, j, i), v -> List(x, w, v), q -> List(s, r, q), b -> List(d, c, b), g -> List(i, h, g), l -> List(n, m, l), p -> List(r, q, p), c -> List(e, d, c), h -> List(j, i, h), r -> List(t, s, r), w -> List(y, x, w), k -> List(m, l, k), o -> List(q, p, o), d -> List(f, e, d))

That did it, but that one-liner isn’t clear at all, so we should break things up a bit. Also, what is “_” and what is “x”? (By which I mean, what are they in terms of the logic of the program? We know they are ways of referring to the elements being mapped over, but they don’t help the human reading the code understand what is going on.)

Let’s start by creating the sliding list of number ranges.


scala> val ranges = (1 to 26).toList.sliding(3).toList
ranges: List[List[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6), List(5, 6, 7), List(6, 7, 8), List(7, 8, 9), List(8, 9, 10), List(9, 10, 11), List(10, 11, 12), List(11, 12, 13), List(12, 13, 14), List(13, 14, 15), List(14, 15, 16), List(15, 16, 17), List(16, 17, 18), List(17, 18, 19), List(18, 19, 20), List(19, 20, 21), List(20, 21, 22), List(21, 22, 23), List(22, 23, 24), List(23, 24, 25), List(24, 25, 26))

It’s quite clear what that is now. (The sliding function is a beautiful thing, especially for natural language processing problems.)

Next, we zip the letters with the ranges and create a Map from the pairs using toMap. This produces a Map from letters to lists of three numbers. Note that the lengths of the two lists are different: letters has 26 elements and ranges has 24, which means that the last two elements of letters (‘y’ and ‘z’) get dropped in the zipped list.


scala> val letter2range = letters.zip(ranges).toMap
letter2range: scala.collection.immutable.Map[java.lang.String,List[Int]] = Map(e -> List(5, 6, 7), s -> List(19, 20, 21), x -> List(24, 25, 26), n -> List(14, 15, 16), j -> List(10, 11, 12), t -> List(20, 21, 22), u -> List(21, 22, 23), f -> List(6, 7, 8), a -> List(1, 2, 3), m -> List(13, 14, 15), i -> List(9, 10, 11), v -> List(22, 23, 24), q -> List(17, 18, 19), b -> List(2, 3, 4), g -> List(7, 8, 9), l -> List(12, 13, 14), p -> List(16, 17, 18), c -> List(3, 4, 5), h -> List(8, 9, 10), r -> List(18, 19, 20), w -> List(23, 24, 25), k -> List(11, 12, 13), o -> List(15, 16, 17), d -> List(4, 5, 6))

Note that we could have broken this into two steps, first creating the zipped list and then calling toMap on it. However, it is perfectly clear what the intent is when one zips two lists (creating a list of pairs) and then uses toMap on it immediately, so this is certainly a case where it makes sense to put multiple operations on a single line.

At this point we could of course process the letter2range Map using a one-liner.


scala> letter2range.mapValues(_.map(x => letters(x-1)).sorted.reverse)
res1: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] = Map(e -> List(g, f, e), s -> List(u, t, s), x -> List(z, y, x), n -> List(p, o, n), j -> List(l, k, j), t -> List(v, u, t), u -> List(w, v, u), f -> List(h, g, f), a -> List(c, b, a), m -> List(o, n, m), i -> List(k, j, i), v -> List(x, w, v), q -> List(s, r, q), b -> List(d, c, b), g -> List(i, h, g), l -> List(n, m, l), p -> List(r, q, p), c -> List(e, d, c), h -> List(j, i, h), r -> List(t, s, r), w -> List(y, x, w), k -> List(m, l, k), o -> List(q, p, o), d -> List(f, e, d))

This is better than what we started with because we at least know what letter2range is, but it still isn’t clear what is going on after that. To make this more comprehensible, we can break it up over multiple lines and give more descriptive names to the variables. The following produces the same result as above.


letter2range.mapValues (
  range => {
    val alphavalues = range.map (number => letters(number-1))
    alphavalues.sorted.reverse
  }
)

Notice that:

  • I called it range rather than _ which is a better indicator of what mapValues is working with.
  • After the => I use an open left bracket {
  • The next lines are a block of code that I can use like any block of code, which means I can create variables and break things down into smaller, more understandable steps. For example the line creating alphavalues makes it clear that we are taking a range and mapping it to the corresponding indices in the letters list (e.g., the range 2, 3, 4 becomes ‘b’,'c’,'d’). For such a list, we then sort and reverse it (okay, so it started out sorted, but you can imagine plenty of times you need to do such sorting).
  • The last line of that block is what the result of the overall mapValue for that element (here, indicated by the variable range) is.

Basically, we get a lot more breathing room, and this becomes even more essential as you dig deeper or do more complex operations during a map-within-a-map operation. Having said that, you should ask yourself whether you should just create and use a function that has a clear semantics and does the job for you. For example, here’s an alternative to the above strategy that is perhaps clearer.


def lookupSortAndReverse (range: List[Int], alpha: List[String]) =
  range.map(number => alpha(number-1).sorted.reverse)

We’ve defined a function that takes a range and a list of letters (called alpha in the function) and produces the sorted and reversed list of letters corresponding to the numbers in the range. In other words, it is what the anonymous function defined after range in the previous code block did. We can thus easily use it at the top-level mapValue operation with completely clear intent and comprehensibility.

letter2range.mapValues(range => lookupSortAndReverse(range, letters))

Of course, you should especially consider creating such functions if you use the same operation in multiple places.

Closures

One further final note. Note that I passed the letters list into the lookupSortAndReverse function such that its value was bound to the function internal variable alpha. You may wonder whether I needed to include that, or whether it is possible to directly access the letters list in the function. In fact you can: provided that letters has already been defined, we can do the following.

def lookupSortAndReverseCapture (range: List[Int]) =
  range.map(number => letters(number-1).sorted.reverse)

letter2range.mapValues(range => lookupSortAndReverseCapture(range))

This is called a closure, meaning that the function has incorporated free variables (here, letters) that come from outside its own scope. I generally don’t use this strategy with named functions like this, but there are many natural situations for using closures. In fact you do it all the time when you are creating anonymous functions as arguments to functions like map and mapValue and their cousins. As a reminder, here was the map-within-a-mapValue anonymous function we defined before.

letter2range.mapValues (
  range => {
    val alphavalues = range.map (number => letters(number-1))
    alphavalues.sorted.reverse
  }
)

The letters variable has been “closed over” in the anonymous function range => { … }, which is not very different from what we did with the closure-style lookupSortAndReverse function.

All the code in one spot

Since there are some dependencies between the different steps in this tutorial that could get things mixed up, here’s all the code in one spot such that you can run it easily.


// Get a list of the letters
val letters = "abcdefghijklmnopqrstuvwxyz".split("").toList.tail

// Now create a list that maps each letter to a list containing itself
// and the two letters after it, in reverse alphabetical
// order. (Bizarre, but hey, it's a simple example. BTW, we lose y and
// z in the process.)

letters.zip((1 to 26).toList.sliding(3).toList).toMap.mapValues(_.map(x => letters(x-1)).sorted.reverse)

// Pretty unintelligible. Let's break things up a bit

val ranges = (1 to 26).toList.sliding(3).toList
val letter2range = letters.zip(ranges).toMap
letter2range.mapValues(_.map(x => letters(x-1)).sorted.reverse)

// Okay, that's better. But it is easier to interpret the latter if we break things up a bit

letter2range.mapValues (
  range => {
    val alphavalues = range.map (number => letters(number-1))
    alphavalues.sorted.reverse
  }
)

// We can also do the one-liner coherently if we have a helper function.

def lookupSortAndReverse (range: List[Int], alpha: List[String]) =
  range.map(number => alpha(number-1).sorted.reverse)

letter2range.mapValues(range => lookupSortAndReverse(range, letters))

// Note that we can "capture" the letters value, though this makes the
// requires letters to be defined before lookupSortAndReverse in the
// program.

def lookupSortAndReverseCapture (range: List[Int]) =
  range.map(number => letters(number-1).sorted.reverse)

letter2range.mapValues(range => lookupSortAndReverseCapture(range))

Wrapup

Hopefully this will encourage you to use clearer coding style and demonstrates some aspects of code blocks that you may not have realized. However, this just scratches the surface of writing clearer code, and a lot of it will just come with time and practice and realizing how necessary it is when you look back at code you wrote months ago.

Note that one easy thing you can do to create better code is to try to stick established coding conventions. For example, see the coding guidelines for Scala on the Scala documentation project. There is also a lot of other very useful stuff, including tutorials, and it is actively evolving and growing!

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: SBT, scalabha, packages, build systems

Preface

This is part 11 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial gives an introduction to building Scala applications using SBT (the Simple Build Tool). This will be done in the context of the Scalabha package, which I have created for primarily for my Introduction to Computational Linguistics class. Some supporting code is available in Scalabha for some basic natural language processing tasks; most relevant at the moment is the code that is in Scalabha that supports the part-of-speech tagging homework for the class.

The previous tutorial showed how Scala code can be compiled with scalac and then run with scala. One problem we ended up with is that there were generated class files littering the working directory. Another thing we did not discuss is how a large system can be created in a modular way that organizes code and classes. For example, you might want to have code in different directories generate classes that can be used by one another. You also may want want to incorporate classes from other libraries into your own code. The solutions we’ll discuss to address these needs and more are build systems and packages.

Note: The tutorial assumes you are using some version of Unix. If you are on Windows, you should consider using Cygwin, or you could dual boot your computer.

Note: In this tutorial, I’ll assume you are using as simple text editor to modify files. However, note that the general setup you are working with here can be used from more powerful Integrated Developer Environements (IDEs) like Eclipse, IntelliJ, and NetBeans.

Setting up Scalabha

We’ll work with SBT, which is perhaps the most popular build tool for Scala.  The Scalabha toolkit mentioned earlier uses SBT (version 0.11.0), so we’ll discuss SBT in the Scalabha context.

The first thing you need to do is download Scalabha v0.1.1 Next unzip the file, change to the directory it unpacked to, and list the directory contents.

$ unzip scalabha-0.1.1-src.zip
Archive:  scalabha-0.1.1-src.zip
<lots of output>
$ cd scalabha-0.1.1
$ ls
CHANGES.txt README      build.sbt   project
LICENSE     bin         data        src

Briefly, these contents are:

  • README: A text file describing how to install Scalabha on your machine.
  • LICENSE: A text file giving the license, which is the Apache Software License 2.0.
  • CHANGES.txt: A text file describing the modifications made for each version (not much so far).
  • build.sbt: A text file that contains instructions for SBT regarding how to build Scalabha
  • bin: A directory that contains the scalabha script, which will be used to run applications developed within the Scalabha build system and also to run SBT itself. It also contains sbt-launch-0.11.0.jar, which is a bottled up package of SBT’s classes that will allow us to use SBT very easily. There are some other files that are Perl scripts that are relevant for a research project and aren’t important here.
  • data: A directory containing part-of-speech tagged data for English and Czech that forms the basis for the fourth homework of my Introduction to Computational Linguistics course this semester.
  • project: A directory containing a single file “plugins.sbt” which tells SBT to use the Assembly plugin. More on this later.
  • src: The most important directory of all — it contains the source code of the Scalabha system, and is where you’ll be adding some code as you work with SBT.

At this point you should read the README and get Scalabha set up on your computer, including building the system from source. In this tutorial, I will give some extra details on using SBT and code development with it, complementing and extending the brief information given in the README.

Note that I will refer the environment variable SCALABHA_DIR below. As specified in the README, you should set this variable’s value to be where you unpacked Scalabha. For example, for me this directory is ~/devel/scalabha.

Tip: to make it so that you don’t have to set your environment variables every time you open a new shell, you can set environment variables in your ~/.profile (Mac, Cygwin) or ~/.bash_aliases (Ubuntu) files. For example, this is in my profile files on my machines.

export SCALABHA_DIR=$HOME/devel/scalabha
export PATH=$PATH:$SCALABHA_DIR/bin

SBT: The Simple Build Tool

This is not a tutorial about setting up a project to use SBT — it is simply about how to use a project that is already set up for SBT. So, if you are looking for resources about learning SBT, what you’ll mainly find are resources to help programmers configure SBT for their project. These will likely confuse you (the Simple Build Tool is not so simple any more, when it comes to configuration). Using it is straightforward, but the kind of know-how that experienced coders have with using something like SBT is what you probably won’t find much help on. Here, I intend to give the basics so that you have a better starting point for doing more with SBT.

First off, there is a bit of slight of hand with Scalabha that could be confusing. Rather than having users install SBT themselves, I have put the jar file for SBT in the bin directory of Scalabha; then, the scalabha executable (in that same directory) can pick that up and use it to run SBT. (My students and I have set up a number of Scala/Java projects in this way, including Fogbow, Junto, Textgrounder, and Updown.) The scalabha executable has a number of execution targets (more on this later), and one of these is “build“. When you call scalabha’s build target, it invokes SBT and drops you into the SBT interface.

Do the following, in your SCALABHA_DIR.

$ scalabha build
[info] Loading project definition from /Users/jbaldrid/devel/scalabha/project
[info] Set current project to Scalabha (in build file:/Users/jbaldrid/devel/scalabha/)
>

You could have achieved the same by downloading SBT and running it according to the instructions for SBT, but this setup saves you that trouble and ensures that you get the right version of SBT. It is just worth pointing out so that you don’t think that Scalabha is SBT –  SBT is entirely independent of Scalabha.

If you have had any trouble with the Scalabha setup, you can create an issue on the Scalabha Bitbucket site. That just means that I’ll get a notice that you had some problems and can hopefully help you out. And, it is possible that someone else will have had the same problem, in which case you might find your answer there. Most of the problems with this sort of setup are due to confusions about environment variables and unfamiliarity with command line tools.

Compiling with SBT

Let’s actually do something with SBT now. If you successfully got through the README, you will have already done what is next, but I’ll give some more details about what is going on.

Because you may have run some SBT actions already as part of doing the README, start out by running the “clean” action so that we’re on the same page.

> clean
[success] Total time: 0 s, completed Oct 26, 2011 10:18:08 AM

Then, run the “compile” action.

> compile
[info] Updating {file:/Users/jbaldrid/devel/scalabha/}default-86efd0...
[info] Done updating.
[info] Compiling 13 Scala sources to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 9 s, completed Oct 26, 2011 10:18:19 AM

In another shell (which means another command line window), go to SCALABHA_DIR and list the contents of the directory. You’ll see that two new directories have been created, lib_managed and target. The first is where other libraries have been download from the internet and placed into the Scalabha project space so that they can be easily used — don’t worry about this for the time being. The second is where the compiled class files have gone. To see some example class files, do the following.

$ ls target/classes/opennlp/scalabha/postag/
BaselineTagger$$anonfun$tag$1.class
BaselineTagger.class
EnglishTagInfo$$anonfun$zipWithTag$1$1.class
<... many more class files ...>
RuleBasedTagger$$anonfun$tag$2.class
RuleBasedTagger$$anonfun$tagWord$1.class
RuleBasedTagger.class

These were generated from the following source files.

$ ls src/main/scala/opennlp/scalabha/postag/
HmmTagger.scala PosTagger.scala

Open up PosTagger.scala in a text editor and look at it — you’ll see the class and object definitions that were the sources for the generated class files in the target/classes directory. Basically, SBT has conveniently handled the separation of source and compile class files so that we don’t have the class files littering our work space.

How does SBT know where the class files are? Simple: it is configured to look at src/main/scala and compile every .scala file it finds under that directory. In just a bit, you’ll start adding your own scala files and be able to compile and run them as part of the Scalabha build system.

Next, at the SBT prompt, invoke the “package” action.

> package
[info] Updating {file:/Users/jbaldrid/devel/scalabha/}default-86efd0...
[info] Done updating.
[info] Packaging /Users/jbaldrid/devel/scalabha/target/scalabha-0.1.1.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Oct 26, 2011 10:19:02 AM

In the shell prompt that we used to list files previously, list the contents of the target directory.

$ ls target/
cache              classes            scalabha-0.1.1.jar streams

You have just created scalabha-0.1.1.jar, a bottled up version of the Scalabha code that others could use in their own libraries. The extension “jar” stands for Java Archive, and it is basically just a zipped up collection of a bunch of class files.

Scalabha itself uses another of supporting libraries produced by others. To see the jars that are used as supporting libraries by Scalabha, do the following.

$ ls lib_managed/jars/*/*/*.jar
lib_managed/jars/jline/jline/jline-0.9.94.jar
lib_managed/jars/junit/junit/junit-3.8.1.jar
lib_managed/jars/org.apache.commons/commons-lang3/commons-lang3-3.0.1.jar
lib_managed/jars/org.clapper/argot_2.9.1/argot_2.9.1-0.3.5.jar
lib_managed/jars/org.clapper/grizzled-scala_2.9.1/grizzled-scala_2.9.1-1.0.8.jar
lib_managed/jars/org.scalatest/scalatest_2.9.0/scalatest_2.9.0-1.6.1.jar

Of course, you may still be wondering what it means to “use a library” in your code. More on this after we talk about packages and actually start doing some code ourselves.

Packages

Projects with a lot of code are generally organized into a package that has a set of sub-packages for parts of the code base that work closely together. At the very high level, a package is simply a way to ensure that we have unique fully qualified names for classes. For example, there is a class called Range in the Apache Commons Lang library and in the core Scala library. If you want to use both of these classes in the same piece of code, there is an obvious problem of a name conflict. Fortunately, they are contained within packages that allow us to refer to them uniquely.

  • Range in the Apache Commons Lang library is org.apache.commons.lang3.Range
  • Range in Scala is scala.collection.immutable.Range

So, when we do need to use them together, we are still able to do so without conflict. You’ve actually already seen some package names before, for example with java.lang.String and the distinction between scala.collection.mutable.Map and scala.collection.immutable.Map.

To see the packages and classes in Scalabha, run the “doc” action in SBT.

> doc
[info] Generating API documentation for main sources...
model contains 35 documentable templates
[info] API documentation generation successful.
[success] Total time: 7 s, completed Oct 26, 2011 10:22:23 AM

Now, point your browser to the file target/api/index.html. Note: this means doing “open file” and then going to your SCALABHA_DIR and then to target, then to api, and then selecting index.html. You can then browse the packages and classes in Scalabha. For example, look at HmmTagger, which is in the package opennlp.scalabha.postag, and you’ll see some of the fields and functions that are made available by that class.

But, you may still be wondering: how do I use these packages and classes in my code anyway? We do so via import statements. We’ll explore this by creating our own source code and compiling it.

Creating and compiling new code in SBT

First, we’ll begin by just doing a simple hello world application that is done in the context of Scalabha and uses a package name. Get set up for this by doing the following set of commands.

Now, point your browser to the file target/api/index.html. Note: this means doing “open file” and then going to your SCALABHA_DIR and then to target, then to api, and then selecting index.html. You can then browse the packages and classes in Scalabha. For example, look at HmmTagger, which is in the package opennlp.scalabha.postag, and you’ll see some of the fields and functions that are made available by that class.

But, you may still be wondering: how do I use these packages and classes in my code anyway? We do so via import statements. We’ll explore this by creating our own source code and compiling it.

Creating and compiling new code in SBT

First, we’ll begin by just doing a simple hello world application that is done in the context of Scalabha and uses a package name. Get set up for this by doing the following set of commands.

$ cd $SCALABHA_DIR
$ cd src/main/scala/opennlp/
$ mkdir bcomposes

Next, using a text editor, create the file Hello.scala in the src/main/scala/opennlp/bcomposes directory with the following contents.

package opennlp.bcomposes

object Hello {
  def main (args: Array[String]) = println("Hello, world!")
}

This is just like the hello world object from the previous tutorial, but now it has the additional package specification that indicates that its fully qualified name is opennlp.bcomposes.Hello.

Because the source code for Hello.scala is in a sub-directory of the src/main/scala directory, we can now compile this file using SBT. Make sure to save Hello.scala, and then go back to your SBT prompt and type “compile“.

> compile
[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 1 s, completed Oct 26, 2011 10:35:15 AM

Notice that it compiled just one Scala source: SBT has already compiled the other source files in Scalabha, so it only had to compile the new one that you just saved.

Having successfully created and compiled the opennlp.bcomposes.Hello object, we can now run it. The scalabha executable provides a “run” target that allows you to run any of the code you’ve produced in the Scalabha build setup. In your shell, type the following.

$ scalabha run opennlp.bcomposes.Hello
Hello, world!

There is actually a bunch of stuff going on under the hood that ensures that your new class is included in the CLASSPATH and can be used in this manner (see bin/scalabha for details). This will simplify things for you considerable. To make a long story short, getting the CLASSPATH appropriately set is one of the main points of confusion for new developers; this way you can keep on moving without having to worry about what is essentially a plumbing problem.

Now, let’s say you want to change the definition of the Hello object to also print out an additional message that is supplied on the command line. Modify the main method to look like this.

def main (args: Array[String]) {
  println("Hello, world!")
  println(args(0))
}

Now save it, and try running it.

$ scalabha run opennlp.bcomposes.Hello Goodbye
Hello, world!

Oops — it didn’t work?! I’ve just forced you directly into a common point of confusion for students who are switching from scripting to compiling: you must compile before it can be used. So, invoke compile in SBT, and then try that command again.

$ scalabha run opennlp.bcomposes.Hello Goodbye
Hello, world!
Goodbye

To see what happens when you produce a syntax error in your Scala code, go back to Hello.scala and change first print statement in the main method so that it is missing the last quote:

println("Hello, world!)

Now go back to SBT and compile again to see the love letter you get from the Scala compiler.

[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[error] /Users/jbaldrid/devel/scalabha/src/main/scala/opennlp/bcomposes/Hello.scala:5: unclosed string literal
[error]     println("Hello, world!)
[error]             ^
[error] /Users/jbaldrid/devel/scalabha/src/main/scala/opennlp/bcomposes/Hello.scala:7: ')' expected but '}' found.
[error]   }
[error]   ^
[error] two errors found
[error] {file:/Users/jbaldrid/devel/scalabha/}default-86efd0/compile:compile: Compilation failed
[error] Total time: 0 s, completed Oct 26, 2011 11:02:07 AM

The compile attempt failed, and you must go back and fix it. But don’t do that yet. There’s a handy aspect of SBT in this write-save-compile loop that saves you time and effort: SBT allows triggered executation of actions, which means that SBT can automatically perform an action if there is a change to the stuff it cares about. The compile action cares about the source code, so it can monitor changes in the file system and automatically recompile any time a file is saved. To do this, you simply add ~ in front of the action.

Before fixing the error, type ~compile into SBT. You’ll see the same error message as before, but don’t worry about that. The last line of output from SBT will say:

1. Waiting for source changes... (press enter to interrupt)

Now go to Hello.scala again, add the quote back in, and save the file. This triggers the compile action in SBT, so you’ll see it automatically compile, with a success message.

[info] Compiling 1 Scala source to /Users/jbaldrid/devel/scalabha/target/classes...
[success] Total time: 0 s, completed Oct 26, 2011 11:02:49 AM
2. Waiting for source changes... (press enter to interrupt)

This is a nice way to see if your code is compiling as you work on it, with very little effort. Every time you save the file, it will let you know if there are problems. And, you’ll also be able to use the scalabha run target and know that you are using the latest compiled version when you do so.

As you develop your code in this way, you can invoke the “doc” action in SBT, then reload the index.html page in your browser, and it will show you the updated documentation for the things you’ve created. Try it now and look at the opennlp.bcomposes package that you’ve now created.

Creating code that uses existing packages

Now we can come back to using code from existing packages. In the past (if you’ve gone through all of these tutorials), you’ve seen statements like import scala.io.Source. That came from the standard Scala library, so it is always available to any Scala program. However, you can also use classes developed by others in a similar manner, provided your CLASSPATH is set up such that they are available. That is exactly what SBT does for you: all of the classes that are defined in the src/main/scala sub-directories are ready for your use.

As an example, save the following code as src/main/scala/opennlp/bcomposes/TreeTest.scala. It constructs a standard phrase structure tree for the sentence “I like coffee.”

package opennlp.bcomposes

import opennlp.scalabha.model.{Node,Value}

object TreeTest {

  def main (args: Array[String]) {
    val leaf1 = Value("I")
    val leaf2 = Value("like")
    val leaf3 = Value("coffee")
    val subjNpNode = Node("NP", List(leaf1))
    val verbNode = Node("V", List(leaf2))
    val objNpNode = Node("NP", List(leaf3))
    val vpNode = Node("VP", List(verbNode, objNpNode))
    val sentenceNode = Node("S", List(subjNpNode, vpNode))

    println("Printing the full tree:\n" + sentenceNode)
    println("\nPrinting the children of the VP node:\n" + vpNode.children)

    println("\nPrinting the yield of the full tree:\n" + sentenceNode.getTokens.mkString(" "))
    println("\nPrinting the yield of the VP node:\n" + vpNode.getTokens.mkString(" "))
  }

}

There are a few things to note here. The import statement at the top is what tells Scala the fully qualified package names for the classes Node and Value. You could have equivalently written it less concisely as follows.

import opennlp.scalabha.model.Node
import opennlp.scalabha.model.Value

Or, you could have left out the import statement and written the fully qualified names everywhere, e.g.:

val leaf1 = opennlp.scalabha.model.Value("I")

Second, Node and Value are case classes. We’ll discus this more later, but for now, all you need to know is that to create an object of the Node or Value classes, it isn’t necessary to use the “new” keyword.

Third, the print statements are using the Scalabha API (Application Programming Interface) to do useful things with the objects, such as printing out the tree they describe, printing the yield of the nodes (the words that they cover), and so on. The scaladoc you looked at before for Scalabha shows you these functions, so go have a look if you haven’t already.

Note that if you had left the triggered compilation on, SBT will have automatically compiled the TreeTest.scala. Otherwise, make sure to compile it yourself. Then, run it.

$ scalabha run opennlp.bcomposes.TreeTest
Printing the full tree:
Node(S,List(Node(NP,List(Value(I))), Node(VP,List(Node(V,List(Value(like))), Node(NP,List(Value(coffee)))))))

Printing the children of the VP node:
List(Node(V,List(Value(like))), Node(NP,List(Value(coffee))))

Printing the yield of the full tree:
I like coffee

Printing the yield of the VP node:
like coffee

Make and use your own package

By importing the classes you need in this manner, you can get more done by using them as you need. Any class in Scalabha or in the libraries that are included with it will be available for you, including any classes you define. As an example, do the following.

$ cd $SCALABHA_DIR/src/main/scala/opennlp/bcomposes
$ mkdir person
$ mkdir music

Now save the Person class from the previous tutorial as Person.scala in the person directory. Here’s the code again (note the addition of the package statement).

package opennlp.bcomposes.person

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

Now save the following as RadioheadGreeting.scala in the music directory.

package opennlp.bcomposes.music

import opennlp.bcomposes.person.Person

object RadioheadGreeting {

  def main (args: Array[String]) {
    val thomYorke = new Person("Thom", "Yorke", 43, "musician")
    val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
    val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
    val edObrien = new Person("Ed", "O'Brien", 42, "musician")
    val philSelway = new Person("Phil", "Selway", 44, "musician")
    val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
    radiohead.foreach(bandmember => println(bandmember.greet(false)))
  }

}

When we did the compilation tutorial previously, Person.scala and RadioheadGreeting.scala were in the same directory, which allowed the latter to know about the Person class. Now that they are in separate packages, the Person class must be explicitly imported; once you’ve done so, you can code with Person objects just as you did before.

Finally, to run it, we now must specify the fully qualified package name for RadioheadGreeting.

$ scalabha run opennlp.bcomposes.music.RadioheadGreeting
Hi, I'm Thom!
Hi, I'm Johnny!
Hi, I'm Colin!
Hi, I'm Ed!
Hi, I'm Phil!

A note on package names and their relation to directories

Package names are made unique by certain conventions that generally ensure you won’t get clashes. For example, we are using opennlp.scalabha and opennlp.bcomposes, which I happen to know are unique. Quite often these names will include full internet domains, in reverse, like org.apache.commons and com.cloudera.crunch. By convention, we put the source files that are in packages (and subpackages) in directory structures that reflect the names. So, for example, opennlp.bcomposes.music.RadioheadGreeting is in the directory src/main/scala/opennlp/bcomposes/music. However, it is worth noting that this is not a hard constraint with Scala (as it is with Java).

There is a great deal more to using a build system, but this is where I must end this discussion, hoping it is enough to get the core concepts across and make it possible for my students to do the homework on part-of-speech tagging and making use of the opennlp.scalabha.postag package!

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

 

Topics: scripting, compiling, main methods, return values of functions

Preface

This is part 10 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.
The tutorials up to this point have been based on working with the Scala REPL or running basic scripts that are run from the command line. The latter is called “scripting” and usually is done for fairly simple, self-contained coding tasks. For more involved tasks that require a number of different modules and accessing libraries produced by others, it is necessary to work with a build system that brings together your code, others’ code, allows you to compile it, test it, and package it so that you can use it as an application.

This tutorial takes you from running Scala scripts to compiling Scala programs to create byte code that can be shared by different applications. This will act as a bridge to set you up for the next step of using a build system. Along the way, some points will be made about objects, extending on some of the ideas from the previous tutorial about object-oriented programming. At a high level, the relevance of objects to a larger, modularized code base should be pretty clear: objects encapsulate data and functions that can be used by other objects, and we need to be able to organize them so that objects know how to find other objects and class definitions. Build systems, which we’ll look at in the next tutorial, will make this straightforward.

Running Scala scripts

In the beginning, you started with the REPL.

scala> println("Hello, World!")
Hello, World!

Of course, the REPL is just a (very useful) playground for trying out snippets of Scala code, not for doing real work. So, you saw that you could put code like println(“Hello, World!”) into a file called Hello.scala and run it from the command line.

$ scala Hello.scala
Hello, World!

The homeworks and tutorials done so far have worked in this way, though they are a bit more complex. We can even include class definitions and objects created from a class. For example, using the Person class from the previous tutorial, we can put all the code into a file called People.scala (btw, this name doesn’t matter — could as well be Blurglecruncheon.scala).

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

val johnSmith = new Person("John", "Smith", 37, "linguist")
val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
val johnDoe = new Person("John", "Doe", 43, "philosopher")
val johnBrown = new Person("John", "Brown", 28, "mathematician")

val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
people.foreach(person => println(person.greet(true)))

This can now be run from the command line, producing the expected result.

$ scala People.scala
Hello, my name is John Smith. I'm a linguist.
Hello, my name is Jane Doe. I'm a computer scientist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

However, suppose you wanted to use the Person class from a different application (e.g. that is defined in a different file). You might think you could save the following in the file Radiohead.scala, and then run it with Scala.

val thomYorke = new Person("Thom", "Yorke", 43, "musician")
val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
val edObrien = new Person("Ed", "O'Brien", 42, "musician")
val philSelway = new Person("Phil", "Selway", 44, "musician")
val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
radiohead.foreach(bandmember => println(bandmember.greet(false)))

However, if you do “scala Radiohead.scala” you’ll see five errors, each one complaining that the type Person wasn’t found. How could Radiohead.scala know about the Person class and where to find its definition? I’m not aware of a way to do this with scripting-style Scala programming, and even though I suspect there may be a way to do something this simple, I don’t even care to know it. Let’s just get straight to compiling.

Compiling

The usual thing we do with Scala is to compile our programs to byte code. We won’t go into the details of that, but it basically means that Scala turns the text of a Scala program into a compiled set of machine instructions that can be interpreted by your operating system. (It actually compiles to Java byte code, which is one reason it is pretty straightforward to use Java code when coding in Scala.)

So, what does compilation look like? We need to start by changing the code we did above a bit. Make a directory that has nothing in it, say /tmp/tutorial. Then save the following as PersonApp.scala in that directory.

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

object PersonApp {

  def main (args: Array[String]) {
    val johnSmith = new Person("John", "Smith", 37, "linguist")
    val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
    val johnDoe = new Person("John", "Doe", 43, "philosopher")
    val johnBrown = new Person("John", "Brown", 28, "mathematician")

    val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
    people.foreach(person => println(person.greet(true)))
  }

}

Notice that the code looks pretty similar to the script above, but now we have a PersonApp object with a main method. The main method contains all the stuff that the original script had after the Person definition. Notice also that there is an args argument to the main method, which should look familiar now. What you are seeing is that a Scala script is basically just a simplified view of an object with a main method. Such scripts use the convention that the Array[String] provided to the method is called args.

Okay, so now consider what happens if you run “scala PersonApp.scala” — nothing at all. That’s because there is no executable code available outside of the object and class definitions. Instead, we need to compile the code and then run the main method of specific objects. The next step is to run scalac (N.B. “scalac” with a “c”, not “scala”) on PersonApp.scala. The name scalac is short for Scala compiler. Do the following steps in the /tmp/tutorial directory.

$ scalac PersonApp.scala
$ ls
Person.class                    PersonApp.class
PersonApp$$anonfun$main$1.class PersonApp.scala
PersonApp$.class

Notice that a number of *.class files have been generated. These are byte code files that the scala application is able to run. A nice thing here is that it all the compilation is done: when in the past you ran “scala” on your programs (scripts), it had to first compile the instructions and then run the program. Now we are separating these steps into a compilation phase and a running phase.

Having generated the class files, we can run any object that has a main method, like PersonApp.

$ scala PersonApp
Hello, my name is John Smith. I'm a linguist.
Hello, my name is Jane Doe. I'm a computer scientist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

Try running “scala Person” to see the error message it gives you.

Next, move the Radiohead.scala script that you saved earlier into this directory and run it.

$ scala Radiohead.scala
Hi, I'm Thom!
Hi, I'm Johnny!
Hi, I'm Colin!
Hi, I'm Ed!
Hi, I'm Phil!

This is the same script, but now it is in a directory that contains the Person.class file, which tells Scala everything that Radiohead.scala needs to construct objects of the Person class. Scala makes available any class file that is defined in the CLASSPATH, an environment variable that by default includes the current working directory.

Despite this success, we’re going away from script land with this post, so change the contents of Radiohead.scala to be the following.

object RadioheadGreeting {

  def main (args: Array[String]) {
    val thomYorke = new Person("Thom", "Yorke", 43, "musician")
    val johnnyGreenwood = new Person("Johnny", "Greenwood", 39, "musician")
    val colinGreenwood = new Person("Colin", "Greenwood", 41, "musician")
    val edObrien = new Person("Ed", "O'Brien", 42, "musician")
    val philSelway = new Person("Phil", "Selway", 44, "musician")
    val radiohead = List(thomYorke, johnnyGreenwood, colinGreenwood, edObrien, philSelway)
    radiohead.foreach(bandmember => println(bandmember.greet(false)))
  }

}

Then run scalac on all of the *.scala files in the directory. There are now more class files, corresponding to the RadioheadGreeting object we defined.

$ scalac *.scala
$ ls
Person.class                            Radiohead.scala
PersonApp$$anonfun$main$1.class         RadioheadGreeting$$anonfun$main$1.class
PersonApp$.class                        RadioheadGreeting$.class
PersonApp.class                         RadioheadGreeting.class
PersonApp.scala

You can now run “scala RadioheadGreeting” to get the greeting from the band members. Notice that the file RadioheadGreeting was saved in was called Radiohead.scala and that no class files were generated called Radiohead.class, etc. Again, the file name could have been named something entirely different, like Turlingdrome.scala. (Embrace your inner Vogon.)

Multiple objects in the same file

There is no problem having multiple objects with main methods in the same file. When you compile the file with scalac, each object generates its own set of class files, and you call scala on whichever class file contains the definition for the main method you want to run. As an example, save the following as Greetings.scala.

object Hello {
  def main (args: Array[String]) {
    println("Hello, world!")
  }
}

object Goodbye {
  def main (args: Array[String]) {
    println("Goodbye, world!")
  }
}

object SayIt {
  def main (args: Array[String]) {
    args.foreach(println)
  }
}

Next compile the file and then you can run any of the generated class files (since they all have main methods).

$ scalac Greetings.scala
$ scala Hello
Hello, world!
$ scala Goodbye
Goodbye, world!
$ scala Goodbye many useless arguments
Goodbye, world!
$ scala SayIt "Oh freddled gruntbuggly" "thy micturations are to me" "As plurdled gabbleblotchits on a lurgid bee."
Oh freddled gruntbuggly
thy micturations are to me
As plurdled gabbleblotchits on a lurgid bee.

In case you missed it earlier, the args array is where the command line arguments go and you can thus make use of them (or not, as in the case of the Hello and Goodbye objects).

Functions with return values versus those without

Some functions return a value while others do not. As a simple example, consider the following pairs of functions.

scala> def plusOne (x: Int) = x+1
plusOne: (x: Int)Int

scala> def printPlusOne (x: Int) = println(x+1)
printPlusOne: (x: Int)Unit

The first takes an Int argument and returns an Int, which is a value. The other takes an Int and returns Unit, which is to say it doesn’t return a value. Notice the difference in behavior between the two following uses of the functions.

scala> val foo = plusOne(2)
foo: Int = 3

scala> val bar = printPlusOne(2)
3
bar: Unit = ()

Scala uses a slightly subtle distinction in function definitions that can distinguish functions that return values versus those that return Unit (no value): If you don’t use an equals sign in the definition, it means that the function returns Unit.

scala> def plusOneNoEquals (x: Int) { x+1 }
plusOneNoEquals: (x: Int)Unit

scala> def printPlusOneNoEquals (x: Int) { println(x+1) }
printPlusOneNoEquals: (x: Int)Unit

Notice that the above definition of plusOneNoEquals returns Unit, even though it looks almost identical to plusOne defined earlier. Check it out.

scala> val foo = plusOneNoEquals(2)
foo: Unit = ()

Now look back at the main methods given earlier. No equals. Yep, they don’t have a return value. They are the entry point into your code, and any effects of running the code must be output to the console (e.g. with println or via a GUI) or written to the file system (or the internet somewhere). The outputs of such functions (ones which do not return a value) are called side-effects. You need them for the main methods. However, in many styles of programming, a great deal of work is done with side-effects. I’ve been trying to gently lead the readers of this tutorial to adopt a more functional approach that tries to avoid them. I’ve found it a more effective style myself in my own coding, so I’m hoping it will serve you all better to start from that point. (Note that Scala supports many styles of programming, which is nice because you have choice and can go with what you find most suitable.)

Cleaning up

You may have noticed that the directory you are working in as you run scalac on your scala files becomes quite littered with class files. For example, here’s what the state of the code directory worked with in this tutorial looks like after compiling all files.

$ ls
Goodbye$.class                          PersonApp.scala
Goodbye.class                           Radiohead.scala
Greetings.scala                         RadioheadGreeting$$anonfun$main$1.class
Hello$.class                            RadioheadGreeting$.class
Hello.class                             RadioheadGreeting.class
Person.class                            SayIt$$anonfun$main$1.class
PersonApp$$anonfun$main$1.class         SayIt$.class
PersonApp$.class                        SayIt.class
PersonApp.class

A mess, right? Generally, one would rarely develop a Scala application by compiling it directly in this way. Instead a build system is used to manage the compilation process, organize the files, and allow one to easily access software libraries created by other developers. The next tutorial will cover this, using SBT (the Simple Build Tool).

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: objects, classes, inheritance, traits, Lists with multiple related types, apply

Preface

This is part 9 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial is about object-oriented programming with Scala. Most of what we’ve seen so far has been programming with functions and using basic types, like Int, Double, and String, and with predefined types like List and Map. As it turns out, these are all classes, or types of Scala data structures that allow one to create objects, or instances of the type. This tutorial will not give a broad introduction to object-oriented programming, but it will give some practical examples of classes and objects and how to use them. I apologize in advance for some sloppiness in the presentation of object-oriented concepts; the intent is to get across the ideas for beginners mainly through intuitive examples without being mired in lots of technical details. See the Wikipedia page on object-oriented programming for more detail.

Note that the definitions of objects and classes in this tutorial are most easily viewed as plain text, out of the REPL. So, I’ll put a piece of code into the text, and you should add it to your own REPL (by simply cutting and pasting) in order to be able to follow along.

Objects

At its core, an object can be thought of as a structure that encapsulates some data and functions. Let’s start with an an example of an object representing a person and some of their possible attributes.

object JohnSmith {
  val firstName = "John"
  val lastName = "Smith"
  val age = 37
  val occupation = "linguist"

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

If you put this into the Scala REPL, you’ll be able to access the fields (firstName, lastName, age, and occupation) and the functions (fullName and greet).

scala> JohnSmith.firstName
res0: java.lang.String = John

scala> JohnSmith.fullName
res1: String = John Smith

scala> JohnSmith.greet(true)
res2: String = Hello, my name is John Smith. I'm a linguist.

scala> JohnSmith.greet(false)
res3: String = Hi, I'm John!

So, at its most basic level, an object is just that: a collection of values and functions (also often called methods). You can access any of those values or functions by giving the name of the object followed by a period followed by the value or function you want to use. This can be useful for organizing such collections, but it also leads to many more possibilities, as we’ll see.

We might of course be interested in having the information about another person encapsulated in this way. We could do this by mimicking the definition for John Smith.

object JaneDoe {
  val firstName = "Jane"
  val lastName = "Doe"
  val age = 34
  val occupation = "computer scientist"

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

After adding the above code to the REPL, now Jane Doe can greet us.

scala> JaneDoe.greet(true)
res4: String = Hello, my name is Jane Doe. I'm a computer scientist.

scala> JaneDoe.greet(false)
res5: String = Hi, I'm Jane!

Of course, I created the JaneDoe object by doing a copy-and-paste and then replacing the fields with Jane Doe’s information. This leads to a lot of wasted effort: the fields are the same, but the values are different, and the functions are completely identical. If you want to change something about the way greetings are made, you’d have to update it across all of the objects.

More importantly, these two objects are completely distinct from one another: one cannot put them in a list and map a function over that list. Consider the following failed attempt.

scala> val people = List(JohnSmith, JaneDoe)
people: List[ScalaObject] = List(JohnSmith$@698fcb66, JaneDoe$@5f72cbae)

scala> people.map(person => person.firstName)
<console>:11: error: value firstName is not a member of ScalaObject
people.map(person => person.firstName)
                                          ^

The only thing that Scala knowns about JohnSmith and JaneDoe is that they are ScalaObjects. That means that a list of such objects can basically just contain them and allow you to move them around as a group. So, something more is needed to make these collections more useful and more general.

Classes

With the list above, what we’d like to have is a List[Person], where Person is a type that has known fields and functions. We can accomplish this by defining a Person class and then defining John and Jane as members of that class. This also reduces the cut-and-paste duplication problem noted earlier. Here’s what it looks like.

class Person (
  val firstName: String,
  val lastName: String,
  val age: Int,
  val occupation: String
) {

  def fullName: String = firstName + " " + lastName

  def greet (formal: Boolean): String = {
    if (formal)
      "Hello, my name is " + fullName + ". I'm a " + occupation + "."
    else
      "Hi, I'm " + firstName + "!"
  }

}

The class keyword indicates that this is a class definition and Person is the name of the class. The next part of the definition is a set of parameters to the class that allow us to construct objects that are instances of the class — in other words, they are placeholders that allow us to use the Person class as a factory for creating Person objects. We do this by using the new keyword, giving the name of the class and supplying the values for each of the parameters. For example, here’s how we can create John Smith now.

scala> val johnSmith = new Person("John", "Smith", 37, "linguist")
johnSmith: Person = Person@1979d4fb

Just as we could with the one-off standalone JohnSmith object previously, we can now access the fields and functions.

scala> johnSmith.age
res8: Int = 37

scala> johnSmith.greet(true)
res9: String = Hello, my name is John Smith. I'm a linguist.

Defining other people is now easy, and doesn’t require any cutting-and-pasting.

scala> val janeDoe = new Person("Jane", "Doe", 34, "computer scientist")
janeDoe: Person = Person@7ff5376c

scala> val johnDoe = new Person("John", "Doe", 43, "philosopher")
johnDoe: Person = Person@6544c984

scala> val johnBrown = new Person("John", "Brown", 28, "mathematician")
johnBrown: Person = Person@4076a247

These Person objects can now be put into a list together, giving us a List[Person] that allows mapping to retrieve specific values, like first names and ages, and performing computations like calculating the average age of the individuals in the list.

scala> val people = List(johnSmith, janeDoe, johnDoe, johnBrown)
people: List[Person] = List(Person@1979d4fb, Person@7ff5376c, Person@6544c984, Person@4076a247)

scala> people.map(person => person.firstName)
res10: List[String] = List(John, Jane, John, John)

scala> people.map(person => person.age)
res11: List[Int] = List(37, 34, 43, 28)

scala> people.map(person => person.age).sum/people.length.toDouble
res12: Double = 35.5

We can sort them according to age.

scala> val ageSortedPeople = people.sortBy(_.age)
ageSortedPeople: List[Person] = List(Person@4076a247, Person@7ff5376c, Person@1979d4fb, Person@6544c984)

scala> ageSortedPeople.map(person => person.fullName + ":" + person.age)
res13: List[java.lang.String] = List(John Brown:28, Jane Doe:34, John Smith:37, John Doe:43)

We can also group people by first name, last name, etc.

scala> people.groupBy(person => person.firstName)
res14: scala.collection.immutable.Map[String,List[Person]] = Map(Jane -> List(Person@7ff5376c), John -> List(Person@1979d4fb, Person@6544c984, Person@4076a247))

scala> people.groupBy(person => person.lastName)
res15: scala.collection.immutable.Map[String,List[Person]] = Map(Brown -> List(Person@4076a247), Smith -> List(Person@1979d4fb), Doe -> List(Person@7ff5376c, Person@6544c984))

With this, we can have all the Johns greet us.

scala> people.groupBy(person => person.firstName)("John").foreach(john => println(john.greet(true)))
Hello, my name is John Smith. I'm a linguist.
Hello, my name is John Doe. I'm a philosopher.
Hello, my name is John Brown. I'm a mathematician.

Standalone objects

Above, we saw how to create instances of the Person class by using the new keyword and assigning the resulting object to a variable. We can come back full circle to the first JohnSmith object we created, which was a standalone ScalaObject. We can instead create such a standalone object by extending the Person class.

scala> object ThomYorke extends Person("Thom", "Yorke", 43, "musician")
defined module ThomYorke

scala> ThomYorke.greet(true)
res25: String = Hello, my name is Thom Yorke. I'm a musician.

By extending the Person class to create the object, we are saying that the object is a kind of Person — see more on inheritance below. So, ThomYorke is a Person object, like the others we created, but it is for a different use case that we’ll see more of in the next tutorial. For now, I’ll summarize, very roughly, by saying that the ThomYorke object can be made more accessible by other code that might be using my code, while the johnSmith and janeDoe objects are going to be more locally contained.

Inheritance

The standalone objects lead us naturally to the idea of inheritance. In many domains, there are natural hierachies of types, such that properties of a super type are inherited by its subtypes (e.g. fish have gills and swim, so salmon have gills and swim). For example, we could have a Linguist type that is a kind of Person, a ComputerScientist type that is a kind of Person, and so on. To model this, we create one class that extends another and possibly provides some additional parameters, such as the following definition of a Linguist sub-type of Person.

class Linguist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  val favoriteLanguage: String
) extends Person(firstName, lastName, age, "linguist") {

  def workGreeting =
    "As a " + occupation + ", I am a " + speciality + " who likes to study the language " + favoriteLanguage + "."

}

The Linguist class has its own parameter list: some of these, like firstName, lastName, and age, are passed on to Person, and there are new parameter fields speciality and favoriteLanguage. The extends portion of the definition passes on the relevant parameters needed to construct all the information to make a Person, and for a Linguist, it directly sets the occupation parameter to be “linguist” — thus, we don’t need to provide that when we construct a Linguist, such as Noam Chomsky.

scala> val noamChomsky = new Linguist("Noam", "Chomsky", 83, "syntactician", "English")noamChomsky: Linguist = Linguist@54c0627f

Having defined a Linguist object in this way, we can ask it to give its work greeting.

scala> noamChomsky.workGreeting
res26: java.lang.String = As a linguist, I am a syntactician who likes to study the language English.

We can also access fields and functions of Person objects, like age and greet.

scala> noamChomsky.age
res27: Int = 83

scala> noamChomsky.greet(true)
res28: String = Hello, my name is Noam Chomsky. I'm a linguist.

Of course, the Linguist-specific fields like favoriteLanguage are accessible too.

scala> noamChomsky.favoriteLanguage
res29: String = English

The observant reader will have noticed that some of the parameters are prefaced with val and others are not. We’ll get back to that distinction a bit later.

Traits

We could of course now go on to define a ComputerScientist class that would also have  workGreeting function, but the Linguist.workGreeting and ComputerScientist.workGreeting would be entirely separate. To enable this, we can use traits, which are like classes, but which define an interface of functions and fields that classes can supply concrete values and implementations for.  (Note: traits can also define concrete fields and functions, so they aren’t limited to placeholder functions as we show below.)

As an example, here’s a Worker trait, which simply defines a function workGreeting and declares that it must return a String.

trait Worker {
  def workGreeting: String
}

The Linguist class defined earlier already provides an implementation of that function. To allow a Linguist to be considered as a type of Worker, we add with Worker after extending Person.

class Linguist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  val favoriteLanguage: String
) extends Person(firstName, lastName, age, "linguist") with Worker {

  def workGreeting =
    "As a " + occupation + ", I am a " + speciality + " who likes to study the language " + favoriteLanguage + "."

}

This is called “mixing in” the trait Worker, because the Linguist class mixes in the fields and functions of Worker with those of Person.

Note that we can also create classes that simply extend a trait like Worker.

class Student (school: String, subject: String) extends Worker {
  def workGreeting = "I'm studying " + subject + " at " + school + "!"
}

We can now create a Student object and request their greeting.

scala> val anonymousStudent = new Student("The University of Texas at Austin", "history")
anonymousStudent: Student = Student@734445b5

scala> anonymousStudent.workGreeting
res32: java.lang.String = I'm studying history at The University of Texas at Austin!

Notice that the parameters school and subject were not preceded by val in the definition of Student. That means that they are not member fields of the Student class, which means that they cannot be accessed externally. For example, attempting to access the value provided for school for anonymousStudent fails.

scala> anonymousStudent.school
<console>:11: error: value school is not a member of Student
anonymousStudent.school

Of course, internally, Student can use the values provided to such parameters, for example in defining the result of workGreeting. This sort of encapsulation hides properties of the objects of a class from code that is outside the class; this strategy can help reduce the degrees of freedom available to users of your code so that they only use what you want them to. In general, if others don’t need to use it, you shouldn’t make it available to them.

Returning to classes that are both Persons and Workers, when we define a ComputerScientist, we do a similar extends … with declaration as we did for Linguist.

class ComputerScientist (
  firstName: String,
  lastName: String,
  age: Int,
  val speciality: String,
  favoriteProgrammingLanguage: String
) extends Person(firstName, lastName, age, "computer scientist") with Worker {

  def workGreeting =
    "As a " + occupation + ", I work on " + speciality + ". Much of my code is written in " + favoriteProgrammingLanguage + "."

}

Let’s create Andrew McCallum as a ComputerScientist object.

scala> val andrewMcCallum = new ComputerScientist("Andrew", "McCallum", 44, "machine learning", "Scala")
andrewMcCallum: ComputerScientist = ComputerScientist@493cd5ba

scala> andrewMcCallum.workGreeting
res31: java.lang.String = As a computer scientist, I work on machine learning. Much of my code is written in Scala.

Because we redefined Linguist to be a Worker, we need to recreate Noam Chomsky using the new definition. (The creation looks the same as before, but it uses the new class definition that has been updated in the REPL.)

scala> val noamChomsky = new Linguist("Noam", "Chomsky", 83, "syntactician", "English")
noamChomsky: Linguist = Linguist@6fccaf14

A minor thing to note: the speciality field of ComputerScientist is disconnected from that of Linguist, so there is no particular expectation of consistency of use across the two: for Linguist it is a description of a person working in a sub-area but for ComputerScientist is a description of a sub-area.

So, what happens if we put noamChomsky and andrewMcCallum in a List together?

scala> val professors = List(noamChomsky, andrewMcCallum)
professors: List[Person with Worker] = List(Linguist@6fccaf14, ComputerScientist@493cd5ba)

Scala has created a list with type List[Person with Worker]; this is the most specific type that is valid for all elements of the list. It means we can treat all of the elements as Persons, e.g. accessing their occupation (which is a member field of Person).

scala> professors.map(prof => prof.occupation)
res34: List[String] = List(linguist, computer scientist)

And we can treat each element of the list as a Person and a Worker, e.g. printing out their fullName (from Person) and their workGreeting (from Worker).

scala> professors.foreach(prof => println(prof.fullName + ": " + prof.workGreeting))
Noam Chomsky: As a linguist, I am a syntactician who likes to study the language English.
Andrew McCallum: As a computer scientist, I work on machine learning. Much of my code is written in Scala.

We cannot, however, access fields and functions that are specific to Linguists or ComputerScientists, such as favoriteLanguage from Linguist.

scala> professors.map(prof => prof.favoriteLanguage)
<console>:15: error: value favoriteLanguage is not a member of Person with Worker
professors.map(prof => prof.favoriteLanguage)

It is easy to see why Scala has this behavior: even though that would have been valid for noamChomsky, it would not be for andrewMcCallum (according to the way we defined Linguist and ComputerScientist).

Matching on types in polymorphic Lists

Consider what happens when the anonymousStudent is in a list with the professors.

scala> val workers = List(noamChomsky, andrewMcCallum, anonymousStudent)
workers: List[ScalaObject with Worker] = List(Linguist@6fccaf14, ComputerScientist@493cd5ba, Student@734445b5)

The Person type is gone, and we now have a list of a more general type ScalaObject with Worker. Now we can only use the workGreeting method from Worker.

However, it is worth pointing out that match statements come in handy when you have collections of heterogenous objects. For example, put the following code into the REPL.

val people = List(johnSmith, noamChomsky, andrewMcCallum, anonymousStudent)

people.foreach { person =>
  person match {
    case x: Person with Worker => println(x.fullName + ": " + x.workGreeting)
    case x: Person => println(x.fullName + ": " + x.greet(true))
    case x: Worker => println("Anonymous:" + x.workGreeting)
  }
}

The result is the following (remember that johnSmith was never defined as a Linguist — he was defined as a Person whose occupation is “linguist”).

John Smith: Hello, my name is John Smith. I'm a linguist.
Noam Chomsky: As a linguist, I am a syntactician who likes to study the language English.
Andrew McCallum: As a computer scientist, I work on machine learning. Much of my code is written in Scala.
Anonymous:I'm studying history at The University of Texas at Austin!

So, we can switch our behavior by matching to more specific types using Scala’s pattern matching.

The apply function

Scala provides a simple but incredibly nice feature: if you define an apply function in a class or object, you don’t actually need to write “apply” in order to use it. As an example, the following object adds one to an argument supplied to its apply method.

object AddOne {
  def apply (x: Int): Int = x+1
}

So, we can use it just like you’d normally expect.

scala> AddOne.apply(3)
res41: Int = 4

But, we can also do without the “.apply” portion and get the same result.

scala> AddOne(3)
res42: Int = 4

If a class has an apply method, then we can do the same trick with any object of that class.

class AddN (amountToAdd: Int) {
  def apply (x: Int): Int = x + amountToAdd
}

scala> val add2 = new AddN(2)
add2: AddN = AddN@43ca04a1

scala> add2(5)
res43: Int = 7

scala> val add42 = new AddN(42)
add42: AddN = AddN@83e591f

scala> add42(8)
res44: Int = 50

As it turns out, you’ve been using apply methods quite often, without knowing it! When you have a List and you access an element by index, you’ve used the apply method of the List class.

scala> val numbers = 10 to 20 toList
numbers: List[Int] = List(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

scala> numbers(3)
res46: Int = 13

scala> numbers.apply(3)
res47: Int = 13

Same thing for accessing values using keys in a Map, and similarly for many other of the classes you’ve been using in Scala so far.

Wrap-up

This tutorial has covered the basics of object-oriented programming in Scala. Hopefully, it is enough to give a decent sense of what objects and classes are and how you can do things with them. There is much much more to be learned about them, but this should be sufficient to get you started so that further study can be done meaningfully. It is important to understand these concepts since Scala is object-oriented from the ground up. In fact, in many of the previous tutorials, I’ve at times gone through some extra hoops to try to describe what is going on without having to talk about object-orientation. But now you can see things like Int, Double, List, Map, and so on for what they are: classes that contain particular fields and functions that you can use to get things done. You can now start coding your own classes to enable your own custom behaviors in your applications.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

In loving memory of Belle Scarlett Baldridge
September 29, 2011

I buried my baby daughter Belle today. It wasn’t supposed to be this way. Babies just aren’t supposed to die. We are fortunate to live in a time of favorable survival rates for babies and their mothers. We enjoy high degrees of order and predictability in our day-to-day lives (here in the USA, at least), and it is easy to forget that one still has innocence to lose. This has been the saddest, hardest week of my life. I had always heard that a parent should never have to bury their own child. I didn’t doubt it, but now I know it, fully. This morning, I gazed down at a gaping hole, my little girl’s grave, while I held her casket in my arms. It mirrored the hole already in my heart. It disarmed and terrified me, but also showed me that both were there to receive Belle and preserve her memory.

With this post, I seek to honor and remember Belle, to thank those who have supported us this week, to help myself grieve, and hopefully, to help—perhaps a little—others in the future who must unfortunately deal with the death of their child. My apologies if the post is on the (melo)dramatic side. It’s how I feel, and it seems to be part of my healing process, so please bear with me.

My wife Cheryl and I had long been anticipating Belle’s arrival, with a due date of today – October 4, 2011. Like most expecting parents, we had considered many of the possible outcomes of the pregnancy, including even the possibility of complications that would involve our baby and/or Cheryl needing hospitalization — but never the possibility that our baby Belle wouldn’t make it into this world, never the possibility of a stillbirth. The unyielding march of life and death has left us suddenly and unexpectedly bereft of a person we loved, cared for and were ready to teach and eventually send forth into the world.

We knew Belle from her kicks, and her responses to our voices, songs, and laughter. It’s an imperfect medium of communication, but it suffices to start the relationship that one builds with one’s child — they simply aren’t strangers when you see them for the first time. This is something that can perhaps be hard to understand for those who have not yet had children, and it is a common source of pain for parents of stillborn children: it is somehow perceived by many to not be as great a loss as for those whose children died after their birth date. A great line I read in one of the many materials I’ve been given about such loss is that on a scale of one to ten, the pain of losing a child is always a ten, no matter the age or circumstances. It’s true. I would submit that there is a further dynamic element for parents of a stillborn child: you have gone from a state of accelerating excitement and anticipation, to a huge resounding thud of shock and disbelief. The “what if’s” have in very short order become “never be’s.” This sudden reversal kicks in the first moment you are told that your baby’s heartbeat has stopped and then reverberates as you reel from the pain and try to regroup.

Little Belle is true to her name: she is beautiful, even in death. I can now only imagine what she would have looked like as she grew up, but thankfully I can do at least that. And, I can do that from a starting point of having been able to spend time with her on the day she was born, September 29, 2011. We had a wonderful team with us at Belle’s birth—including doctors, midwives, nurses, and doula—and they helped us through the intensely emotional and difficult process of bringing Belle into the world and, perhaps more importantly, to help us spend meaningful time with her before saying goodbye. They encouraged us to be with Belle, to hold her and take pictures, and not rush things. We now have at least those memories—even so bittersweet—to keep with us, something which many parents of stillborn babies are never given because no one tells them they can and should. This is a really important aspect of Belle’s birth that I hope to get across: you are hurting and spinning from the shock and pain, yet there are important decisions to be made from the very start; while you may have been provided with comprehensive and well-written literature on how to approach the situation, you have little emotional space for it and there is too much of it to possibly work through before you must make decisions.  If you or someone you care for finds themselves unfortunately in this situation, try to get across this message: take time with the baby and take pictures. You won’t get more chances later, and you’ll almost surely regret it if you don’t.

Another important thing for us was to have a small memorial service for Belle, and also a burial. As an agnostic without any religious affiliation, I had no default expectation for what to do. Cheryl and I had years ago decided that cremation would be the thing for us eventually. However, with Belle, Cheryl quickly realized that she wanted a place to visit her, so we went with a burial. I did not feel strongly about it, but it felt right to me when we did it today, so I’ll probably be glad for that choice in the long run. It was very hard to pick out her plot at the cemetery on Friday—it’s an area reserved for infants, a grid of small plots that serves as a concrete reminder of the fragility of the early days of life. Looking at the empty spot where Belle would be buried made it all seem more real, more this-is-really-happening, in the mix of surreal feelings of that day and the previous day. Of course, handing over a credit card to pay for the services and the plot then felt bizarre, an odd juxtaposition of a completely mundane action with the profound grief I was keeping in check. Regardless of that strangeness, it is one of those things which just must be done. Belle is now there, and it is a peaceful place, with trees and birds singing in them.

It turns out that stillbirths are more common than I would have ever thought. I had only directly known of one before Belle, and had assumed it must have been a case of extreme misfortune. Actually, in the USA, the average rate of stillbirths is roughly 1 in 150 births, about 26,000 babies every year. The rate is much higher in developing countries. Despite this prevalence, there apparently is not a great deal of research into it (and it seems to be an inherently difficult thing to research), so we still know little about specific actions that can be taken to prevent it. For the things we do know, such as tangled umbilical cords, there is very little warning — there is a window of perhaps 5-10 minutes from the time of fetal distress in which to save the baby. Knowing this actually relieved us of a great deal of guilt as we had initially second guessed ourselves, retracing our steps in the days leading up to Belle’s birth and imagining ways we could/should have known to try to get her out earlier.

Regardless of the statistics, regardless of whether we’ll know the cause of Belle’s death, it all just ends up feeling unfair. I’ve been robbed of my little girl, whose heart I had heard beating just days before. Belle should have had her fair shot at life, and I’m sure she would have made hers a great one. It shouldn’t have been this way, but that is what happened and now we must live with that and move on. In this, I’m so thankful for the amazing relationship I have with Cheryl. We’re both hurting, immensely, but we also are optimists who have both already overcome our fair share of challenges in our lives. Together, and with the help of family and friends, we’ll regroup and carry on, carrying Belle’s memory with us.

Little Belle, I’ll love you forever.

Addendum

There are many people who have provided us with amazing, and often unexpected, support over the last week.

Our doula, Shelley Scotka, was our shining light on the day of Belle’s birth. Many people have probably never heard of doulas — summarizing quickly, they are amazing women who assist in natural childbirth. They bring their knowledge of traditional birthing techniques and practical experience from many births to bear on yours, including translating what the doctors are saying and doing so that you hear what is going on, in simple, understandable terms. Shelley was there for our son’s delivery, a 50+ hour marathon that she did a great deal to ease. Little did we know that she would be every bit as vital for us for a stillbirth as she was for a live birth. She was a rock who helped before, during and after the delivery, and who continues to shower us with love and care.

We’re also incredibly thankful for the medical team that delivered Belle last Thursday at St. David’s North Austin. Our practice is OB-GYN North, and the midwives, doctor, and technician who had to tell us that Belle’s heartbeat had stopped were caring and kind, and helped us immensely with the initial shock and disbelief. Kathy Harrison-Short, CNM  had caught our son two years before and she immediately came to comfort us. Lisa Carlile, CNM stayed past her shift and was the one who ultimately caught Belle, at Cheryl’s request. Dr. Martha Smitz was the physician on duty that day. She demonstrated tremendous sensitivity, compassion and overwhelming competence throughout. She had an uncanny ability to put us at ease even in the midst of the sorrow and confusion we were going through. The nurses, other doctors, social worker and pastor were all similarly supportive and sensitive. The nurses deserve special thanks for taking such great care of Cheryl before the delivery and of Belle after it. Everyone treated us, and Belle, with tremendous dignity.

Since that day, our family, friends and colleagues have been incredibly supportive. One of the blessings in tragedy is the concrete realization that one is surrounded by a wonderful support network. My younger brother lives here in Austin and my mother had just arrived, ready to help us with Belle; they’ve been helping us through the whole thing, especially with our toddler son, even while dealing with their own loss and grief. My father flew in from Chicago, and my older brother immediately came over from Baton Rouge with his daughter. The sound of her playing with our toddler son over the weekend was a welcome, joyful addition that helped combat the otherwise tendency toward a somber mood. My brother’s wife helped us a great deal from afar, providing support both as a family member and as a practicing physician. My step-father will be here soon, a delayed visit (at my request) since I knew we’d need more backup once the main family contingent was gone.

Other have also given us great strength, including sharing their own pain and anger at the situation, and in a few cases, their own direct experience with stillbirths. There have been generous offers of help, including offers to teach some of my classes in the coming weeks. Though I’ve so far responded to almost none of them, I’ve read and appreciated every email of support from friends, colleagues, and students. In a way, this post is my response, so please consider this my thank you to you all. And to those who I have not yet gotten in touch with about Belle’s death, please understand that there has not been any particular plan or care with my communications regarding it — I’m just now getting geared up to pass the word on to more friends, and some of you are probably seeing this post as a result of that effort.

I must also give high praise to the people at Cook-Walden funeral homes. They have treated us very kindly and have been incredibly responsive to our needs. One of the things about the situation is that many decisions must be made in rapid succession, and you get some of them not-quite-right the first time around. Cook-Walden was very accommodating to changes in how we wanted to do the service and burial and to requests for articles of Belle’s that we only realized later that we’d want (such as a lock of her hair). They treated us and Belle with dignity and allowed us time and space to make decisions and say goodbye to her.

Finally, I must thank the volunteers from Now I Lay Me Down To Sleep, who Shelley called in for us. NILMDTS is a non-profit that has professional photographers who come to take pictures of stillborn babies and their families, and then later retouch them to provide nicer images of the baby than one could generally hope to capture by oneself. They were caring and professional, and we look forward to seeing the result of their work with Belle. If you are looking for a great non-profit to donate to, please consider them.

Topics: scala.io.Source, accessing files, flatMap, mutable Maps

Preface

This is part 8 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This tutorial is about accessing the file system in order to work with text files. The previous tutorial showed how to build a Map that contains the counts of each word type in a given text. However, it was assumed that the text was available in a String variable, and typically we are interested in knowing things about files that live on the file system, or on the internet. This tutorial shows how to read a file’s contents into Scala for processing, both by building a single String for the file or by consuming it line-by-line in a streaming fashion. Along the way, immutable Maps are introduced as a way to enable word counting without reading an entire file into memory.

Word count on the contents of a file

As an example, we’ll use the complete Sherlock Holmes from project Gutenberg. Download it, put it into a directory, and then start up the Scala REPL in that directory. To access files, we’ll use the Source class, so to start you need to import it.

scala> import scala.io.Source
import scala.io.Source

Source provides a number of ways to interact with files and make them accessible to you in your Scala program. The fromFile method is the one you’ll probably need most.

scala> Source.fromFile("pg1661.txt")
res3: scala.io.BufferedSource = non-empty iterator

This creates a BufferedSource, from which you can easily get all of file’s contents as a String.

scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many more lines...>

With this, you can do the same things as shown it tutorial 7 to get the word counts (except that here we’ll split on white space sequences rather than just a single space).

scala> val counts = holmes.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(wood-work, -> 1, "Pray, -> 1, herself. -> 2, stern-post -> 1, "Should -> 1, incident -> 8, serious -> 14, earth--" -> 2, sinister -> 10, comply -> 7, breaks -> 1, forgotten -> 3, precious -> 10, 'It -> 3, compliment -> 2, suite, -> 1, "DEAR -> 1, summarise. -> 1, "Done -> 1, fine.' -> 1, lover -> 5, of. -> 2, lead. -> 1, plentiful -> 1, 'Lone -> 4, malignant -> 1, terrible -> 14, rate -> 1, mole -> 1, assert -> 1, lights -> 2, Stevenson, -> 1, submitted -> 4, tap. -> 1, beard, -> 1, band--a -> 1, force! -> 1, snow -> 7, Produced -> 2, ask, -> 1, purchasing -> 1, Hall, -> 1, wall. -> 5, remarked -> 32, laughing -> 4, member." -> 1, 30,000 -> 2, Redistributing -> 1, coat, -> 6, "'One -> 2, 'band,' -> 1, relapsed -> 1, apol...

scala> counts("Holmes")
res2: Int = 197

scala> counts("Watson")
res3: Int = 4

Lest you think it strange that Watson only shows up four times, keep in mind that we split on whitespace, and that means that in a sentence like the following, the token of interest is Watson,” rather than Watson.

“You could not possibly have come at a better time, my dear Watson,” he said cordially.

Looking that and others up shows more tokens containing Watson in the story.

scala> counts("Watson,\"")
res4: Int = 19

scala> counts("Watson,")
res5: Int = 40

scala> counts("Watson.")
res6: Int = 10

Of course, the real problem is that tokenizing on whitespace is too crude. To do this properly generally takes a good hand-built tokenizer (which is able to keep tokens like e.g. and Mr. and Yahoo! while splitting punctuation off most words) or a machine learned one that is trained on data hand-labeled for tokens. For an example of the latter, see the Apache OpenNLP toolkit tokenizers, which includes pre-trained models for English.

Working line by line

Quite often, you need to work through a file line by line, rather than reading the entire thing in as a single string as we did above. For example, you might need to process each line differently, so just having it as a single String isn’t particular convenient. Or, you might be working with a large file that cannot easily fit into memory (which is what happens when you read in the entire string). You can obtain the lines in the file as an Iterator[String], in which each item is a single line from the file, using the getLines method.

scala> Source.fromFile("pg1661.txt").getLines
res4: Iterator[String] = non-empty iterator

This iterator is ready for you to consume lines, but it doesn’t read all of the file into memory right away — instead it buffers it such that each line will be available for you as you ask for it, essentially reading off disk as you demand more lines. You can think of this as streaming the file to your Scala program, much like modern audio and video content is streamed to your computer: it is never actually stored, but is just transferred in parts to where it is needed, when it is needed.

Of course, Iterators share much with sequence data structures like Lists: once we have an Iterator, we can use foreach, for, map, etc. on it. So to print out all of the lines in the file, we can do the following.

scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle
<...many more lines...>

That creates a lot of output, but it shows you how you can easily create your own Scala implementation of the Unix cat program: just save the following line in a file called cat.scala:

scala.io.Source.fromFile(args(0)).getLines.foreach(println)

And then call that with the name of the file to list its contents.

$ scala cat.scala pg1661.txt

Back in the REPL, it is somewhat less-than-ideal to see the entire file. If you just want to see the start of the file, use the take method on the Iterator before the foreach.

scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included

The take method is quite useful in general with any sequence, and provides the complement of the drop method, as shown in the following examples on a simple List[Int].

scala> val numbers = 1 to 10 toList
numbers: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3)
res12: List[Int] = List(1, 2, 3)

scala> numbers.drop(3)
res13: List[Int] = List(4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3) ::: numbers.drop(3)
res14: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Word counting line by line, first try

Now that we’ve seen how to read a file and start working with it line-by-line, how do we count the number of occurrences of each word? Recall from tutorial 7 and above that the starting point was to have a sequence (Array, List, etc) of Strings in which each element is a word token. To start moving toward that, we can simply use the toList method on the Iterator[String] obtained from getLines.

scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file.  Please do not remove it.  Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file.  Included is, important information about your specific rights and restrictions in, how the file may be used.  You can also find ou...

We now have the contents of the file as a List[String], and may proceed to do useful things with it. For example, we could map each line (Strings) to be sequences of whitespace-separated Strings.

scala> val listOfListOfWords = Source.fromFile("pg1661.txt").getLines.toList.map(x => x.split(" ").toList)
listOfListOfWords: List[List[java.lang.String]] = List(List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle), List(""), List(This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with), List(almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or), List(re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included), List(with, this, eBook, or, online, at, www.gutenberg.net), List(""), List(""), List(Title:, The, Adventures, of, Sherlock, Holmes), List(""), List(Author:, Arthur, Conan, Doyle), List(""), List(Posting, Date:, April, 18,, 2011, [EBook, #1661]), List(First, Posted:, November, 29,, 2002), List(""), List(Language:, English), List(""), List(""), List(***, START, OF, THIS, PRO...

And, as we saw in tutorial 7, when we have a List of Lists, we can use flatten to create one big List.

scala> val listOfWords = listOfListOfWords.flatten
listOfWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project, Gut...

But, now you might recognize that this is the map-then-flatten pattern we saw previously, which means we can flatMap it instead.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.toList.flatMap(x => x.split(" "))
flatMappedWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project,...

But you should be a bit bothered by all this: wasn’t the idea here (in part) not to read all of the lines in at once? Indeed, with what we did above, as soon as we said toList on the Iterator, the whole file was read into memory. However, we can do without the toList step and just directly flatMap the Iterator and get a new Iterator over the tokens rather than the lines.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" "))
flatMappedWords: Iterator[java.lang.String] = non-empty iterator

Now, if we want to count the words, we can convert that to a List and do the groupBy the mapValues trick we’ve seen already (output omitted).

scala> val counts = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" ")).toList.groupBy(x=>x).mapValues(x=>x.length)

Oops — that worked, but we once again brought the whole file into memory because the List that was created from toList has all lines for the file. We’ll see next how to use a mutable Map to get around this.

Word counting by streaming with an Iterator and using mutable Maps

In all of the tutorials so far, I’ve pretty much stuck to immutable data structures except when mutable ones show up due to context (like Arrays coming out of the toString method). It’s good to try to make use of immutable data structures where possible, but there are times when mutable ones are more convenient and perhaps more appropriate.

With the immutable Maps we saw in the previous tutorial, you could not change the assignment to a key, nor could you add a new key.

lettersToNumbers: scala.collection.immutable.Map[java.lang.String,Int] = Map(A -> 1, B -> 2, C -> 3)

[sourcecode language="scala"]
scala> lettersToNumbers("A") = 4
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("A") = 4

scala> lettersToNumbers("D") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("D") = 5

There is another kind of Map, scala.collection.mutable.Map, that does allow this sort of behavior.

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val mutableLettersToNumbers = mutable.Map("A"->1, "B"->2, "C"->3)
mutableLettersToNumbers: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, B -> 2, A -> 1)

scala> mutableLettersToNumbers("A") = 4

scala> mutableLettersToNumbers("D") = 5

scala> mutableLettersToNumbers
res4: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 5, B -> 2, A -> 4)

It also has a handy way to increase the count associated with a key, using the += method.

scala> mutableLettersToNumbers("D") += 5

scala> mutableLettersToNumbers
res6: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 10, B -> 2, A -> 4)

However, we can’t use that method with a key that doesn’t exist.

scala> mutableLettersToNumbers("E") += 1
java.util.NoSuchElementException: key not found: E
<...stacktrace...>

Fortunately, we can provide a default. Here’s an example of starting a new Map with a default of 0.

scala> val counts = mutable.Map[String,Int]().withDefault(x=>0)
counts: scala.collection.mutable.Map[String,Int] = Map()

scala> counts("Z") += 1

scala> counts("Y") += 1

scala> counts("Z") += 1

scala> counts
res11: scala.collection.mutable.Map[String,Int] = Map(Z -> 2, Y -> 1)

Note: when you start with some values already in a Map, Scala can infer the types of the keys and the values, but when initializing an empty Map, it is necessary to explicitly declare the key and value types.

With this in hand, here is how we can use flatMap plus a mutable Map to count words in a text without reading the entire text into memory.

import scala.collection.mutable
val counts = mutable.Map[String, Int]().withDefault(x=>0)
for (token <- scala.io.Source.fromFile("pg1661.txt").getLines.flatMap(x =>x.split("\\s+")))
counts(token) += 1

Having created the counts Map in this way, we can convert it to an immutable Map with the toMap method once we are done adding elements.

scala> val fixedCounts = counts.toMap
fixedCounts: scala.collection.immutable.Map[String,Int] = Map(wood-work, -> 1,
<...output truncated...>

Now we can’t modify the values on fixedCounts, which has advantages in many contexts, e.g. we can’t accidentally destroy values or add unwanted keys, and there are (positive) implications for parallel processing.

scala> fixedCounts("Holmes") = 0
<console>:13: error: value update is not a member of scala.collection.immutable.Map[String,Int]
fixedCounts("Holmes") = 0
^

Reading a file from a URL

As it turns out scala.io.Source can do a lot more than read from a file. Another example is to read from a URL to access a file on the internet, using the fromURL method.

val holmesUrl = """http://www.gutenberg.org/cache/epub/1661/pg1661.txt"""
for (line <- Source.fromURL(holmesUrl).getLines)
println(line)

If you are just going to analyze the same file again and again, this is probably not what you need — just download the file and use it locally. However, it can be quite useful in contexts where you are exploring links within pages (e.g. while processing Wikipedia or Twitter data) and need to read in content from URLs on the fly.

Use (up) the Source

A final note on the Iterators you get with Source.fromFile and Source.fromURL: you can only iterate through them once! This is part of what makes them more efficient — they aren’t holding all thattext in memory. So, don’t be surprised if you get the following behavior.


scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
 holmesIterator: Iterator[String] = non-empty iterator

scala> holmesIterator.foreach(println)

Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever.  You may copy it, give it away or
 re-use it under the terms of the Project Gutenberg License included
 with this eBook or online at www.gutenberg.net

<...many lines of output...>

This Web site includes information about Project Gutenberg-tm,
 including how to make donations to the Project Gutenberg Literary
 Archive Foundation, how to help produce our new eBooks, and how to
 subscribe to our email newsletter to hear about new eBooks.

scala> holmesIterator.foreach(println)

<...nothing output!...>

So, the Iterator is used up! If you want to go through the file again, you’ll need to spin up a new Iterator just like you did the first time around. The neat thing about staying with the Iterators and not converting to Lists (and thus bringing everything into memory) is that each mapping operation we do on the Iterator applies only for the current item we are looking at, so we never need to read the whole file into memory.

Of course, if you have a reasonably small file to work with, you should feel absolutely free to toList it and work with it that way if you prefer — it will often be more convenient since you can do the groupBy and mapValue pattern.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Topics: Maps, Sets, groupBy, Options, flatten, flatMap

Preface

This is part 7 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

Lists (and other sequence data structures, like Ranges and Arrays) allow you to group collections of objects in an ordered manner: you can access elements of a list by indexing their position in the list, or iterate over the list elements, one by one, using for expressions and sequence functions like map, filter, reduce and fold. Another important kind of data structure is the associative array, which you’ll come to know in Scala as a Map. (Yes, this has the unfortunate ambiguity with the map function, but their use will be quite clear from context.) Maps allow you to store a collection of key-value pairs and to access the values by the keys associated with them, rather than via an index (as with a List).

Example cases where you could use a Map:

  • Associating English words with their German translations
  • Associating each word with its count in a given text
  • Associating each word with its possible parts-of-speech

You’ll see concrete examples of each of these in this post.

Creating Maps and accessing their elements

Maps are quite intuitive to grasp. Here’s an example with a few English words and their German translations. One easy way of creating a Map is by passing in a list of pairs, where the first element of each pair defines a key and the second defines a corresponding value.

scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

Notice that the Map entries are of the form key -> value. We may then retrieve the German translation for dog by providing the key “dog” to the Map we created.

scala> engToDeu("dog")
res0: java.lang.String = Hund

Think for a moment what you would have to do to accomplish this with Lists. You’d need need two Lists, one for each language, and they’d need to be aligned so that each element in one list corresponded to its translation in the other list.

scala> val engWords = List("dog","cat","rhinoceros")
engWords: List[java.lang.String] = List(dog, cat, rhinoceros)

scala> val deuWords = List("Hund","Katze","Nashorn")
deuWords: List[java.lang.String] = List(Hund, Katze, Nashorn)

Then, to find the translation of cat, you would have to find the index of cat in engWords, and then look up that index in deuWords.

scala> engWords.indexOf("cat")
res2: Int = 1

scala> deuWords(engWords.indexOf("cat"))
res3: java.lang.String = Katze

This is actually quite inefficient, as well as having other problems. Maps are the right thing for what we want here, and they do they job of retrieving values for keys quite efficiently.

It turns out that we can take two lists that are aligned in this way and construct a Map very easily. Recall that zipping two lists together creates one list of pairs, where each pair gives the elements that shared the same index.

scala> engWords.zip(deuWords)
res4: List[(java.lang.String, java.lang.String)] = List((dog,Hund), (cat,Katze), (rhinoceros,Nashorn))

By calling the toMap method on such a List of pairs, we get a Map.

scala> engWords.zip(deuWords).toMap
res5: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

Note that even though the REPL is showing the order of the key-value pairs to be the same as the original list we constructed the map from, there is no inherent order to the elements of a Map.

You can add elements to a Map to create a new Map using the + operator and an arrow -> between each key and value pair.


scala> engToDeu + "owl" -> "Eule"
res6: (java.lang.String, java.lang.String) = (Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)owl,Eule)

scala> engToDeu + ("owl" -> "Eule", "hippopotamus" -> "Nilpferd")
res7: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

You can add one Map to another using the ++ operator.


scala> val newEntries = Map(("hippopotamus", "Nilpferd"),("owl","Eule"))
newEntries: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(hippopotamus -> Nilpferd, owl -> Eule)

scala> val expandedEngToDeu = engToDeu ++ newEntries
expandedEngToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

You can do the same by passing in a List of tuples to the ++ operator.


scala> engToDeu ++ List(("hippopotamus", "Nilpferd"),("owl","Eule"))
res8: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(rhinoceros -> Nashorn, dog -> Hund, owl -> Eule, hippopotamus -> Nilpferd, cat -> Katze)

And you can remove a key from a Map with the – operator.


scala> engToDeu - "dog"
res9: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(cat -> Katze, rhinoceros -> Nashorn)

See the Map API for more examples of such functions. Note: throughout this post, I’m sticking to immutable Maps — if you are looking at any other tutorials and are wondering why certain methods from those aren’t working here, they may have been using mutable Maps, which we’ll discuss later.

If we ask for the value associated with a key that doesn’t exist in the Map, we get an error.

scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird
at scala.collection.MapLike$class.default(MapLike.scala:224)
(etc.)

You can check for whether a key is in the Map using the contains method.

scala> engToDeu.contains("bird")
res10: Boolean = false

scala> engToDeu.contains("dog")
res11: Boolean = true

Let’s say you had a list of English words and wanted to look up their corresponding German words, and you want to protect yourself against the NoSuchElementException. One way to do this is to filter the words using contains, and then map the remaining ones through engToDeu.

scala> val wordsToTranslate = List("dog","bird","cat","armadillo")
wordsToTranslate: List[java.lang.String] = List(dog, bird, cat, armadillo)

scala> wordsToTranslate.filter(x=>engToDeu.contains(x)).map(x=>engToDeu(x))
res12: List[java.lang.String] = List(Hund, Katze)

This is a useful ways of safely applying a Map to a list of items. However, we’ll see a better way to deal with missing values later on, using Options.

If you there is a sensible default value for any key you might try with your map, you can use the getOrElse method. You provide the key as the first argument, and then the default value as the second.


scala> engToDeu.getOrElse("dog","???")
res1: java.lang.String = Hund

scala> engToDeu.getOrElse("armadillo","???")
res2: java.lang.String = ???

It is quite common to use getOrElse with a default of 0 for Maps that contain statistics, such as word counts (see below), where the absence of a key naturally indicates that it has, e.g., a count of zero.

If you have a consistent default value for any keys that aren’t in the Map, you can set it by using the withDefault method.


scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn")).withDefault(x => "???")
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("armadillo")
res3: java.lang.String = ???

Now you can ask for values in the usual manner, without needing to use getOrElse and providing the default every time.

Keys and values in Maps

You may have observed that Scala tells you more than that you have just created a Map. Like List, Map is a parameterized type, which means that it is a generic way of collecting a bunch of objects of particular types together. Above we saw an instance of a Map[String, String] (leaving off the java.lang part to make it clearer). The first String indicates that the keys are strings and the second that values are Strings. Basically, any type can be used in either position (warning: you should avoid using mutable data structures as keys unless you know what you are doing). Here are some examples (try to ignore the scala.collection.immutable and java.lang parts and just focus on the Map[X,Y] signatures we get).

scala> Map((10,"ten"), (100,"one hundred"))
res0: scala.collection.immutable.Map[Int,java.lang.String] = Map(10 -> ten, 100 -> one hundred)

scala> Map(("a",1),("b",2))
res1: scala.collection.immutable.Map[java.lang.String,Int] = Map(a -> 1, b -> 2)

scala> Map((1,3.14), (2,6.28))
res2: scala.collection.immutable.Map[Int,Double] = Map(1 -> 3.14, 2 -> 6.28)

scala> Map((("pi",1),3.14), (("tau",2),6.28))
res3: scala.collection.immutable.Map[(java.lang.String, Int),Double] = Map((pi,1) -> 3.14, (tau,2) -> 6.28)

scala> Map(("the",List("Determiner")),("book",List("Verb","Noun")),("off",List("Preposition","Verb")))
res4: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] = Map(the -> List(Determiner), book -> List(Verb, Noun), off -> List(Preposition, Verb))

The last two examples show some very useful aspects of key and values types that allow you to use more complex keys and values. The former uses a (String, Int) pair as a key, with signature Map[(String, Int), Double], and the latter uses a List[String] as the value, with signature Map[String, List[String]]. So you can bundle together several types using tuples and you can use parameterized data structures to parameterize another data structure.

A simple translation task

Here is a mini German/English dictionary as a Map.

scala> val miniDictionary = Map(("befreit","liberated"),("baeche","brooks"),("eise","ice"),("sind","are"),("strom","river"),("und","and"),("vom","from"))
miniDictionary: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(und -> and, eise -> ice, sind -> are, befreit -> liberated, strom -> river, vom -> from, baeche -> brooks)

We can provide a (very bad) translation of the German sentence “vom eise befreit sind strom und baeche” using this dictionary: we simply split the German sentence and then map over its elements, looking up each word in the dictionary.

scala> val example = "vom eise befreit sind strom und baeche"
example: java.lang.String = vom eise befreit sind strom und baeche

scala> example.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
res0: String = from ice liberated are river and brooks

Okay, not quite “from the ice they are freed, the stream and brook” but then again it’s pretty much the dumbest machine translation approach available…

A danger of course is that we will have words that aren’t in the dictionary, leading to an exception.

scala> val example2 = "vom eise befreit sind strom und schiffe"
example2: java.lang.String = vom eise befreit sind strom und schiffe

scala> example2.split(" ").map(deuWord => miniDictionary(deuWord)).mkString(" ")
java.util.NoSuchElementException: key not found: schiffe

We’ll return to this below.

Creating Maps from Lists using groupBy

We frequently have data stored in a particular data structure and would like to work with it using another data structure that organizes the data points in some other manner. Here, we’ll look at how to convert a List into Map using the groupBy method in order to do some useful processing for working with parts-of-speech. We’ll also see the Set data structure along the way.

We’ll start with a very basic example of what groupBy does. Given a list of number tokens, we can obtain a Map from the number types to all of the tokens of each number.

scala> val numbers = List(1,4,5,1,6,5,2,8,1,9,2,1)
numbers: List[Int] = List(1, 4, 5, 1, 6, 5, 2, 8, 1, 9, 2, 1)

scala> numbers.groupBy(x=>x)
res19: scala.collection.immutable.Map[Int,List[Int]] = Map(5 -> List(5, 5), 1 -> List(1, 1, 1, 1), 6 -> List(6), 9 -> List(9), 2 -> List(2, 2), 8 -> List(8), 4 -> List(4))

As you can see from the result, groupBy took the anonymous function x=>x, grouped all of the elements of the List that have the same value of x, and then created a Map from each x to the group containing its tokens. So, we get 2 mapping to a List containing 2′s, and so on. This probably seems a bit weird, but it is incredibly useful when we consider Lists that have more interesting elements in them. To do so, let’s go back to the part-of-speech tagging example from Part 4 of these tutorials. Say we have a sentence that is tagged with parts of speech, such as the following (made up) example that ensures some tag ambiguities.

in the dark , a tall man saw the saw that he needed to man to cut the dark tree .

The parts-of-speech could be annotated as follows (with lots of simplifications, and apologies to any offense caused to anyone’s linguistic sensitivities).

in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc

See Part 4 for detailed explanation of how the following expression turns a string like this into a List of tuples.

scala> val tagged = "in/Prep the/Det dark/Noun ,/Punc a/Det tall/Adjective man/Noun saw/Verb the/Det saw/Noun that/Pronoun he/Pronoun needed/Verb to/Prep man/Verb to/Prep cut/Verb the/Det dark/Adjective tree/Noun ./Punc".split(" ").toList.map(x => x.split("/")).map(x => (x(0), x(1)))
tagged: List[(java.lang.String, java.lang.String)] = List((in,Prep), (the,Det), (dark,Noun), (,,Punc), (a,Det), (tall,Adjective), (man,Noun), (saw,Verb), (the,Det), (saw,Noun), (that,Pronoun), (he,Pronoun), (needed,Verb), (to,Prep), (man,Verb), (to,Prep), (cut,Verb), (the,Det), (dark,Adjective), (tree,Noun), (.,Punc))

Now, let’s use groupBy in various ways on this. The first thing we might be interested in is seeing which parts of speech each word is associated with.

scala> val groupedTagged = tagged.groupBy(x => x._1)
groupedTagged: scala.collection.immutable.Map[java.lang.String,List[(java.lang.String, java.lang.String)]] = Map(in -> List((in,Prep)), needed -> List((needed,Verb)), . -> List((.,Punc)), cut -> List((cut,Verb)), saw -> List((saw,Verb), (saw,Noun)), a -> List((a,Det)), man -> List((man,Noun), (man,Verb)), that -> List((that,Pronoun)), dark -> List((dark,Noun), (dark,Adjective)), to -> List((to,Prep), (to,Prep)), , -> List((,,Punc)), tall -> List((tall,Adjective)), he -> List((he,Pronoun)), tree -> List((tree,Noun)), the -> List((the,Det), (the,Det), (the,Det)))

So, now you see that the keys in the Map constructed by groupBy are the words and the values are the groups of the original elements. You can then see that the anonymous function x => x._1 provided to groupBy does two things: it specifies the part of the input elements that will group different items together and it specifies that that part of the input defines the key space.

However, we don’t quite have what we want, which is to have the set of parts of speech associated with each word. Instead we have a List of tuples, e.g.:

scala> groupedTagged("saw")
res21: List[(java.lang.String, java.lang.String)] = List((saw,Verb), (saw,Noun))

Focussing on just this for a moment, we can map this and produce a List with just the parts-of-speech, and then turn that List into a Set with the toSet method in order to get just the unique parts-of-speech.

scala> groupedTagged("saw").map(x=>x._2)
res24: List[java.lang.String] = List(Verb, Noun)

scala> groupedTagged("saw").map(x=>x._2).toSet
res25: scala.collection.immutable.Set[java.lang.String] = Set(Verb, Noun)

Converting the List to a Set didn’t do much here, but consider the, which has multiple tokens with the same part-of-speech.

scala> groupedTagged("the")
res26: List[(java.lang.String, java.lang.String)] = List((the,Det), (the,Det), (the,Det))

scala> groupedTagged("the").map(x=>x._2)
res27: List[java.lang.String] = List(Det, Det, Det)

scala> groupedTagged("the").map(x=>x._2).toSet
res28: scala.collection.immutable.Set[java.lang.String] = Set(Det)

Sets are yet another of the useful data structures you have to work with, along with Maps and Lists. They work just like you would expect Sets to: they contain a collection of unique, unordered elements, and they allow you to see whether an element is in the set, whether one set is a subset of another, iterate over their elements, etc.

Now, back to getting from the word/tag pairs to a mapping from words to possible tags for each word. The keys we got from tagged.groupBy(x => x._1)  are what we want, but we want to transform the values from Lists of word/tag tokens to Sets of tags, which we can do with the mapValues method on Maps.

scala> val wordsToTags = tagged.groupBy(x => x._1).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._2).toSet)
wordsToTags: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(in -> Set(Prep), needed -> Set(Verb), . -> Set(Punc), cut -> Set(Verb), saw -> Set(Verb, Noun), a -> Set(Det), man -> Set(Noun, Verb), that -> Set(Pronoun), dark -> Set(Noun, Adjective), to -> Set(Prep), , -> Set(Punc), tall -> Set(Adjective), he -> Set(Pronoun), tree -> Set(Noun), the -> Set(Det))

The bit inside the mapValues(…) part will have some readers scrunching up their eyes, but you just need to look at the line where we got res28 above: if you understood that, then you just need to realize we are doing exactly the same thing, but now in the context of mapping over the values rather than dealing with a single value. Now you know how to map over values that you are mapping over.

Now that it is hand, we can easily query the wordsToTags Map to see whether various words have various tags.

scala> wordsToTags("man")("Noun")
res8: Boolean = true

scala> wordsToTags("man")("Det")
res9: Boolean = false

scala> wordsToTags("man")("Verb")
res10: Boolean = true

scala> wordsToTags("saw")("Verb")
res11: Boolean = true

This is an example of how data structures within data structures (here Sets within a Map) are quite useful. (Exercise: think about what a tree is for a moment and how you might implement it using Lists.)

There are a variety of things you can do in computational linguistics with Maps from words to their parts-of-speech. A simple example is to compute the average number of tags per word type.

scala> val avgTagsPerType = wordsToTags.values.map(x=>x.size).sum/wordsToTags.size.toDouble
avgTagsPerType: Double = 1.2

If it isn’t clear to you what is going on here, tease it apart in your own REPL!

We can turn our word/tag pairs the other way to find out which words go with each part-of-speech. The only thing we need to do is groupBy on the second element of each pair, and then map the List values to their first element and get a Set from those.

scala> val tagsToWords = tagged.groupBy(x => x._2).mapValues(listOfWordTagPairs => listOfWordTagPairs.map(wordTagPair => wordTagPair._1).toSet)
tagsToWords: scala.collection.immutable.Map[java.lang.String,scala.collection.immutable.Set[java.lang.String]] = Map(Prep -> Set(in, to), Det -> Set(the, a), Noun -> Set(dark, man, saw, tree), Pronoun -> Set(that, he), Verb -> Set(saw, needed, man, cut), Punc -> Set(,, .), Adjective -> Set(tall, dark))

This basic paradigm is a powerful one for flipping between different data structures depending on what our needs are. It also demonstrates several important concepts with working with Lists, Maps and Sets. The next section shows a simple application of this idea for counting words in a text.

Counting words

A common task in computational linguistics is to calculate word statistics, and the most basic of those is to count the number of tokens of each word type in a particular text. The most common way to store and access those counts is in a Map, but how do you create such a Map from a given text? If we look at a text as a list of strings, then the groupBy paradigm we did above gives us exactly what we need — in fact it is even simpler than the word/tag manipulations done above.

The example text we’ll use is the tongue-twister about woodchucks.

scala> val woodchuck = "how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood ."
woodchuck: java.lang.String = how much wood could a woodchuck chuck if a woodchuck could chuck wood ? as much wood as a woodchuck would , if a woodchuck could chuck wood .

Given this, here’s how we can compute the number of occurrences of each word type. First we groupBy on the elements. Though a list of strings isn’t as interesting as having a list of Tuples as we had with words and tags, it still produces a useful result: we now have a unique set of keys corresponding to the types of elements found in the Array, and there is a corresponding value to each one that is the Array of tokens of that type.

scala> woodchuck.split(" ").groupBy(x=>x)
res29: scala.collection.immutable.Map[java.lang.String,Array[java.lang.String]] = Map(woodchuck -> Array(woodchuck, woodchuck, woodchuck, woodchuck), chuck -> Array(chuck, chuck, chuck), . -> Array(.), would -> Array(would), if -> Array(if, if), a -> Array(a, a, a, a), as -> Array(as, as), , -> Array(,), how -> Array(how), much -> Array(much, much), wood -> Array(wood, wood, wood, wood), ? -> Array(?), could -> Array(could, could, could))

And, we want to do something much simpler than what we did with the part-of-speech example: we just need to count the length of each list, since they each contain every token of the corresponding word type. The function passed to mapValues is thus quite a bit simpler than the ones given in the previous section.

scala> val counts = woodchuck.split(" ").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(woodchuck -> 4, chuck -> 3, . -> 1, would -> 1, if -> 2, a -> 4, as -> 2, , -> 1, how -> 1, much -> 2, wood -> 4, ? -> 1, could -> 3)

With counts, we can now access the frequencies of any of the words that were in the text.

scala> counts("woodchuck")
res5: Int = 4

scala> counts("could")
res6: Int = 3

Easy!  Of course, we normally want to build word counts for texts that are longer and are stored in a file rather than explicitly added to Scala code. The next tutorial will demonstrate how to do that.

Iterating over the keys and values in a Map

The material above shows some useful aspects of Maps, but of course there is much more you can do with them, often requiring iterating through the key-value pairs in the Map. We’ll use the counts Map created above for demonstrating this.

You can access just the keys, or just the values.

scala> counts.keys
res0: Iterable[java.lang.String] = Set(woodchuck, chuck, ., would, if, a, as, ,, how, much, wood, ?, could)

scala> counts.values
res1: Iterable[Int] = MapLike(4, 3, 1, 1, 2, 4, 2, 1, 1, 2, 4, 1, 3)

Notice that these are both Iterable data structures, so we can do all of the usual mapping, filtering, and so on, that we have already done with lists. (You may convert them to Lists if you like using toList, of course.)

You can print out all of the key -> value pairs in the Map in a number of ways. One is to use a for expression.

scala> for ((k,v) <- counts) println(k + " -> " + v)
woodchuck -> 4
chuck -> 3
. -> 1
would -> 1
if -> 2
a -> 4
as -> 2
, -> 1
how -> 1
much -> 2
wood -> 4
? -> 1
could -> 3

And here are other ways to achieve the same result (output omitted since it is the same).

for (k <- counts.keys) println(k + " -> " + counts(k))
counts.map(kvPair => kvPair._1 + " -> " + kvPair._2).foreach(println)
counts.keys.map(k => k + " -> " + counts(k)).foreach(println)
counts.foreach { case(k,v) => println(k + " -> " + v) }
counts.foreach(kvPair => println(kvPair._1 + " -> " + kvPair._2))

And so on. Basically, you are able to step through the Map one key-value pair at a time, or you can grab the set of keys and then step through those and access the values from the map. Which form you use depends on what you need — for example, the foreach construct doesn’t return a value, but the for expressions and the map expressions do return values. Why would you do that? Well, as an example, consider grouping all words that have occurred the same number of times.

scala> val countsToWords = counts.keys.toList.map(k => (counts(k),k)).groupBy(x=>x._1).mapValues(x=>x.map(y=>y._2))
countsToWords: scala.collection.immutable.Map[Int,List[java.lang.String]] = Map(3 -> List(chuck, could), 4 -> List(woodchuck, a, wood), 1 -> List(., would, ,, how, ?), 2 -> List(if, as, much))

We go from a Map to a Set of its keys to a List of those keys to a List of Tuples of the values and the keys to a Map from the values of the original Map to such Tuples, and then we map the values of the new map to just contain the words (the original keys). (That’s a mouthful, so try each step in the REPL to see what is going on in detail.)

Now we can output countsToWords sorted in descending numerical order by count, and then by alphabetical order by word within each count.

scala> countsToWords.keys.toList.sorted.reverse.foreach(x => println(x + ": " + countsToWords(x).sorted.mkString(",")))
4: a,wood,woodchuck
3: chuck,could
2: as,if,much
1: ,,.,?,how,would

Options and flatMapping for dealing with missing keys

I pointed out toward the start of this tutorial that we run into trouble if we ask for a key that doesn’t exist in a Map. Let’s go back to the engToDeu Map we began with.

scala> val engToDeu = Map(("dog","Hund"), ("cat","Katze"), ("rhinoceros","Nashorn"))
engToDeu: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(dog -> Hund, cat -> Katze, rhinoceros -> Nashorn)

scala> engToDeu("dog")
res0: java.lang.String = Hund

scala> engToDeu("bird")
java.util.NoSuchElementException: key not found: bird

There is another way of accessing the elements of a Map, using the get method.

scala> engToDeu.get("dog")
res2: Option[java.lang.String] = Some(Hund)

scala> engToDeu.get("bird")
res3: Option[java.lang.String] = None

Now, the return value is an Option[String]. An Option is either a Some that contains a value or a None, which means there is no value. If you want to get the value out of a Some, you use the get method on Options.

scala> val dogTrans = engToDeu.get("dog")
dogTrans: Option[java.lang.String] = Some(Hund)

scala> dogTrans.get
res4: java.lang.String = Hund

If you just use get on a Map to obtain an Option and then immediately call get on the Option, we get the same behavior we had before.

scala> engToDeu.get("dog").get
res6: java.lang.String = Hund

scala> engToDeu.get("bird").get
java.util.NoSuchElementException: None.get

So, at this point, you are probably thinking that this sounds like a waste of time that is just making things more complex. Wait! It actually is tremendously useful because of pattern matching and the way many methods on sequences work.

First, here is how you can write a protected form of translating the words in a list without getting an exception.

scala> wordsToTranslate.foreach { x => engToDeu.get(x) match {
|   case Some(y) => println(x + " -> " + y)
|   case None =>
| }}
dog -> Hund
cat -> Katze

I know… this probably still isn’t convincing — it still looks more involved than the conditional we used (far) above to check whether engToDeu contained a given key (at least for this particular example). Hold on… because now we are just about ready for things to get simpler, and learn some useful things about Lists in doing so.

First, you should know about a great method on Lists called flatten. If you have a List of Lists of Strings, you can use flatten to get a single List of Strings. Consider the following example, in which we flatten a List of Lists of Strings and make a single String out of the result with mkString. Notice that the empty List in the third spot of the main List just disappears when we flatten it.

scala> val sentences = List(List("Here","is","sentence","one","."),List("The","third","sentence","is","empty","!"),List(),List("Lastly",",","we","have","a","final","sentence","."))
sentences: List[List[java.lang.String]] = List(List(Here, is, sentence, one, .), List(The, third, sentence, is, empty, !), List(), List(Lastly, ,, we, have, a, final, sentence, .))

scala> sentences.flatten
res0: List[java.lang.String] = List(Here, is, sentence, one, ., The, third, sentence, is, empty, !, Lastly, ,, we, have, a, final, sentence, .)

scala> sentences.flatten.mkString(" ")
res1: String = Here is sentence one . The third sentence is empty ! Lastly , we have a final sentence .

Flattening in general is pretty useful in its own right. Where it comes to play with Option values is that Options can be thought of a Lists: Somes are like one element Lists and Nones are like empty Lists. So, when you have a List of Options, the flatten method gives you the value in a Some and any Nones just drop away.

scala> wordsToTranslate.map(x => engToDeu.get(x))
res12: List[Option[java.lang.String]] = List(Some(Hund), None, Some(Katze), None)

scala> wordsToTranslate.map(x => engToDeu.get(x)).flatten
res13: List[java.lang.String] = List(Hund, Katze)

This is such a generally useful paradigm that there is a function flatMap which does exactly this.

scala> wordsToTranslate.flatMap(x => engToDeu.get(x))
res14: List[java.lang.String] = List(Hund, Katze)

So, returning to the translation example above, we can now safely skip on by “schiffe” without fuss.

scala> example2.split(" ").flatMap(deuWord => miniDictionary.get(deuWord)).mkString(" ")
res15: String = from ice liberated are river and

Whether this is the desired behavior in this particular case is another question (e.g. you really should be doing some special unknown word handling). Nonetheless, you’ll find that flatMap is quite handy in general for this sort of pattern, in which a list of elements is used to retrieve values from a Map that will be missing some of those values.

An example of the further use of Options and flatMap is that you also may create functions that return Options and are thus amenable to flatMapping. Consider a function that squares only odd numbers and throws evens away (note: the % operator is the modulo operator that finds the remainder of division of one number by another — try it in the REPL).


scala> def squareOddNumber (x: Int) = if (x % 2 != 0) Some(x*x) else None
squareOddNumber: (x: Int)Option[Int]

If you map over the numbers 1 to 10, you’ll see the Somes and Nones, and if you flatMap it, you get exactly the desired result of the squares of all the odd numbers without any pollution from the evens.

scala> (1 to 10).toList.map(x=>squareOddNumber(x))
res16: List[Option[Int]] = List(Some(1), None, Some(9), None, Some(25), None, Some(49), None, Some(81), None)

scala> (1 to 10).toList.flatMap(x=>squareOddNumber(x))
res17: List[Int] = List(1, 9, 25, 49, 81)

This turns out to be amazingly useful and common, so much so that the expression “just flatMap that shit” has become a common refrain among Scala programmers. Scala programmers even write scripts to remind them to do it. :)

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

Follow

Get every new post delivered to your Inbox.