Student Questions about Scala, Part 2

Topics: toMap, accessing directory contents, calling R from Scala, Java/Scala comparisons and interop, supporting libraries, object-oriented + functional programming, NLP and Scala


This is the second post answering questions from students in my course on Applied Text Analysis. You can see the first one here. This post generally covers higher level questions, starting off with one basic question that didn’t make it into the first post.

Basic Question

Q. When I was working with Maps for the homework and tried to turn a List[List[Int]] into a map, I often got the error message that Scala “cannot prove that Int<:<(T,U)”. What does that mean?

A. So, you were trying to do the following.

scala> val foo = List(List(1,2),List(3,4))
foo: List[List[Int]] = List(List(1, 2), List(3, 4))

scala> foo.toMap
<console>:9: error: Cannot prove that List[Int] <:< (T, U).

This happens because you are trying to do the following at the level of a single two-element list, which can be more easily seen in the following.

scala> List(1,2).toMap
<console>:8: error: Cannot prove that Int <:< (T, U).

So, you need to convert each two-element list to a tuple, and then you can call toMap on the list of tuples.

scala>{case List(a,b)=>(a,b)}.toMap
<console>:9: warning: match is not exhaustive!
missing combination            Nil{case List(a,b)=>(a,b)}.toMap
res3: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

You can avoid the warning messages by flatMapping (which is safer anyway).

scala> foo.flatMap{case List(a,b)=>Some(a,b); case _ => None}.toMap
res4: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

If you need to do this sort of thing a lot, you could use implicits to make the conversion from two-element Lists into Tuples, as discussed in the previous post about student questions.

File system access

Q. How can I make a script or program pull in every file (or every file in a certain format) from a directory that is given as a command line argument and perform operations on it?

A. Easy. Let’s say you have a directory example_dir with the following files.

$ ls example_dir/
file1.txt      file2.txt      file3.txt      program1.scala program2.scala

I created these with some simple contents. Here’s a bash command that will print out each file and its contents so you can recreate them (and also see a handy command line for loop).

$ for i in `ls example_dir`; do echo "File: $i"; cat example_dir/$i; echo; done
File: file1.txt

File: file2.txt
Nice to meet you.

File: file3.txt

File: program1.scala

File: program2.scala



So, here’s how we can do the same using Scala. In the same directory that contains example_dir, save the following as ListDir.scala.

val mydir = new
val allfiles = mydir.listFiles
val contents = { file => io.Source.fromFile(file).mkString } { case(file,content) =>
  println("File: " + file.getName)

You can now run it as scala ListDir.scala example_dir.

If you want to look at only files of a particular type, use filter on the list of files returned by mydir.listfiles. For example, the following gets the Scala files and prints their names.

val scalaFiles = mydir.listFiles.filter(_.getName.endsWith(".scala"))

As an exercise, now consider what you would need to do to recursively explore a directory that has directories and list the contents of all the files that are in it. Tip: you’ll need to use the isDirectory() method of

Q. Is it possible to run an R program within a Scala program? Like write a Scala program that performs R operations using R. If so, how? Are there directory requirement of some sort?

A.Though I haven’t used them, you could look at the JRI (Java-R Interface) or RCaller.

For some simple things, you can always take the strategy of saving some data to a file, calling an R program that processes that file and produces some output in one or more files, which you then read back into Scala. This is useful for other things you might want to do, including invoking arbitrary applications to compute and output some values based on data created by your program.

Here’s an example of doing something like this. Save the following as something like CallR.scala, and then run scala CallR.scala. It assumes you have R installed.


val data = List((4,1000), (3,1500), (2,1500), (2,6000), (1,14000), (0,18000))

val outputFilename = "vague.dat"
val bwriter = new BufferedWriter(new FileWriter(outputFilename))

val dataLine = {
  case(numAdjectives, price) => "c("+numAdjectives+","+price+")"

  """data = rbind(""" + dataLine + ")" + "\n" +
  """pdf("out.pdf")""" + "\n" +
  """plot(data)"""+ "\n" +
  """data.lm = lm(data[,2] ~ data[,1])""" +  "\n" +
  """abline(data.lm)""" +  "\n" +
  """""" + "\n")

val command = List("R", "-f", outputFilename)

It takes a set of points as a Scala List[(Int,Int)] and creates a set of R commands to plot the points, fit a linear regression model to the points, plot the regression line, and then output a PDF. I took the particular set of points used here from the example in Jurafsky and Martin in the chapter on maximum entropy (multinomial logistic regression), which is based on a study of how vague adjectives in a house listing affect its purchase price. For example, houses that had four vague adjectives in their listing sold for $1000 over their list price, while ones with one vague adjective sold for $14,000 over list price (read the book Freakonomics for some fascinating discussion of this).

Here’s the R code that is produced.

data = rbind(c(4,1000),c(3,1500),c(2,1500),c(2,6000),c(1,14000),c(0,18000))
data.lm = lm(data[,2] ~ data[,1])

Here is the image produced in vague_lm.pdf.

To recap, the basic logic of this process is the following.

  1. Have or create some set of points in Scala (which, to be useful, would be based on some computation you ran and now need to go to R for to complete).
  2. Use this data to create an R script programatically using Scala code.
  3. Run the R script using scala.sys.process.

You could also have the R script output text information to a file which you could then read back into Scala and parse to get your results.

Note that this is not necessarily the most robust way to do this in general, but it does demonstrate a way to do things like calling system commands from within a Scala program.

Another alternative is to look at frameworks like ScalaLab, which aims to support a Matlab-like environment for Scala. It’s on my stack of things to look at, and it would allow one to use Scala to directly do much of what one would want to call out to R and other such languages for.

High level questions

Q. Since Scala runs over JVM, can we conclude that anything that was written in Scala, can be written in Java? (with loss of performance and may be with lengthly code).

A. For any two sufficiently expressive languages X and Y, one can write anything in X using Y and vice versa. So, yes. However, in terms of the ease of doing this, it is very easy to translate Java code to Scala, since the latter supports mutable, imperative programming of the kind usually done in Java. If you have Scala code that is functional in nature, it will be much harder to translate easily to Java (though it can of course be done).

Efficiency is a different question. Sometimes the functional style can be less efficient (especially if you are limiting yourself to a single machine), so at times it can be advantageous to use while loops and the like. However, for most cases, efficiency of programmer time matters more than efficiency of running time, so quickly putting together a solution using functional means and then optimizing it later — even at the “cost” of being less functional — is, in my mind, the right way to go. Josh Suereth has a nice blog post about this, Macro vs Micro Optimization, highlighting his experiences at Google.

Compared to Scala, the amount of code written will almost always be longer in Java, due both to the large amount of boilerplate code and to the higher-level nature of functional programming. I find that Scala programs (written in idiomatic, functional style) converted from Java are generally 1/4th to 1/3rd the number of characters of their Java counterparts. Going from Python to Scala also tends to produce less lengthy code, perhaps 3/4ths to 5/6ths or so in my experience. (Though this depends a great deal on what kind of Scala style you are using, functional or imperative or a mix).

Q. Scala seems to be relatively new — so, does it have supporting libraries for common tasks in NLP, like good JSON/XML parsers that you know of?

A. Sure. Basically anything that has been written for the JVM is quite straightforward to use with Scala. For natural language processing, we’ll be using the Apache OpenNLP library (which I and Gann Bierner began in 1999 while at the University of Edinburgh), but you can also use other toolkits like the Stanford NLP software, Mallet, Weka, and others. In fact, using Scala often makes it much easier to use these toolkits. There are also Scala specific toolkits that are beginning to appear, including Factorie, ScalaNLP, and Scalabha (which we are using in the class).

Scala has native XML support that I find pretty handy, though others wish it weren’t in the language. It is covered in most of the books on Scala, and Dan Spiewak has a nice blog post on it: Working with Scala’s XML Support.

The native JSON support isn’t great, but Java libraries for JSON work just fine.

Q. General question/comment: Scala lies in the region between object-oriented and functional programming language. My question is — Why? Is it because it makes coding a lot simpler and reduces the number of lines? In that case, I guess python achieves this goal reasonably well, and it has a rich library for processing strings. I am able to appreciate certain things, and ease of getting things done in Scala, but I am not exactly sure why this was even introduced, that too in a somewhat non-standard way (such a mixture of OOP and functional programming paradigm is the first that I have heard of).

I’ll defer to Odersky, the creator of Scala. This is from his blog post “Why Scala?“:

Scala took a risk in that, before it came out, the object-oriented and functional approaches to programming were largely disjoint; even today the two communities are still sometimes antagonistic to each other. But what the team and I have learned in our daily programming practice since then has fully confirmed our initial hopes. Objects and functions play extremely well together; they enable new, expressive programming styles which lend themselves to high-level domain modeling and and embedded domain-specific languages. Whether it’s log-analysis at NASA, contract modelling at EDF, or risk analysis at many of the largest financial institutions, Scala-based DSLs seem to spring up everywhere these days.

Here are some other interesting reads that touch on these questions:

Q. Do you see any distinct advantage of using Scala for NLP-related stuff? I know this is not a very specific question, but it would be great if you continue highlighting the difference between scala and other languages (like Java, Python) so that our understanding becomes clearer and clearer with more examples.

A. In many ways, such questions are a matter of personal taste. I used Python and Java before I switched to primarily using Scala. I liked Python for rapid prototyping, and Java for large-scala system development. I find Scala to be as good, or better, for prototyping than Python, and it is every bit as good, or better, than Java for large scale development. Now, I can use a single language — Scala — for most development. The exception is that I still use R for plotting data sets and also doing certain statistical analyses. The transition from Java to Scala was straightforward, and I went from writing Java-as-Scala to a more and more functional style as I got more comfortable with the language. The resulting code is far better designed, making it more robust, more extensible, and more fun.

Specifically with respect to NLP, a definite advantage of Scala is that, as mentioned previously, it is really easy to use existing Java libraries (or any JVM library, for that matter). Another is that as one uses a more functional style, that makes it easier to transition (in terms of both thinking and actual coding) to certain kinds of distributed computing architectures, such as MapReduce. As a really interesting example of Scala and distributed computing, check out Spark. With so much of text analytics being performed on massive datasets, this capability has become increasingly important. Another thing is that the actor-based computing model supported by the Akka library (which is closely tied to the core Scala libraries) also holds many attractions for building language processing systems that need to deal with asynchronous information flows and data processing (FWIW, Akka can be used from Java, though far less enjoyable than from Scala). It is also quite handy for creating distributed versions of many classes of machine learning algorithms that can take better advantage of the structure of the solution than the one-size-fits-all MapReduce strategy can. For examples, you can check out the Akka version of Modified Adsorption and the Hadoop version of the same algorithm in the Junto toolkit.

At the end of the day, though, whether one language is “better” than another will depend on a given programmer’s preferences and abilities. For example, a great alternative to Scala is Clojure, which is dynamically typed, also JVM-based, and also functional — even more so than Scala. So, when evaluating this or that language, ask whether you can get more done more quickly and more maintainably. The outcome will be a function of the capabilities of the language and your skill as a programmer.

Q. In C++ a class is just a blueprint of an object and it has a size of 1 no matter how many members it has. Does the size of a Scala class depend on its members? Also, is there anything corresponding to “sizeof” operator in Scala?

A. I don’t know the answer to this. Any useful responses from readers would be welcome, and I’ll add them to this answer if and when they come in.

Copyright 2012 Jason Baldridge

The text of this post is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original post.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at or provide a comment to this post.

  1. Vincent M. said:

    Thanks a lot.

    I liked the part1 and I like this one also :)
    Very helpful.

  2. reddit said:

    After I open up your Rss feed it appears to be to be a lot of junk, is the problem on my part?

    • I’m not sure — I haven’t done anything wrt RSS for the blog. Sorry!

  3. Here are a few comments. The relevant question is repeated, followed by my comment on your answer.

    Q. When I was working with Maps for the homework and tried to turn a List[List[Int]] into a map, I often got the error message that Scala “cannot prove that Int<: val badFoo = List(List(1,2), List(9), List(3,4))
    badFoo: List[List[Int]] = List(List(1, 2), List(9), List(3, 4))

    scala> badFoo.flatMap{case List(a,b)=>Some(a,b); case _ => None}.toMap
    res37: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

    We accidentally had an element in `foo` of length 1, but the `flatMap` silently dropped it! By this measure, the `case List(a,b)` approach is definitely the safest (which is why I use it and just roll my eyes at the warnings).

    An almost-as-good approach that gets rid of the warning is to use `Seq` instead of `List` since `Seq` is an open class (meaning that Scala won’t check for exhaustiveness).

    scala> val foo = List(List(1,2), List(3,4))
    foo: List[List[Int]] = List(List(1, 2), List(3, 4))

    scala>{case Seq(a,b)=>(a,b)}.toMap
    res38: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

    As for my second point, if you _did_ want to silently drop non-matches (which does come up, though less frequently), you should use the `collect` method that is specifically designed for this purpose. `collect` is just a `map` that ignores unmatched items.

    scala> badFoo.collect { case Seq(a,b) => (a,b) }.toMap
    res40: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

    In addition to being safer, this is also much shorter and simpler that the flatMap approach that wraps and unwraps the item and also requires multiple cases.

    OR, use `toTuple2`, which I defined in my comment on part 1 of your post.

    res43: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

    • Thanks Dan! The flatMap approach is great for lots of situations where you *know* you want to drop some items — perhaps that was not the best example for its use. :)

      • I would argue that flatMap only makes sense when you are _starting_ with a collection of Options or collections. If you’re just looking to filter arbitrary items (which is what you’re doing), then you should use `collect`, since that is its specific purpose.

        In other words, the following lines do the same thing, but the `collect` version is shorter, cleaner, and doesn’t wrap and unwrap the item unnecessarily.

        foo.flatMap { case List(a,b) => Some(a,b); case _ => None }.toMap
        foo.collect { case Seq(a,b) => (a,b) }.toMap

  4. Sure, but I’m not talking about collecting items, but instead about situations where you are actually mapping the input sequence items through an function and that function sometimes encounters items that aren’t valid/relevant and you need to drop them (and it is okay to do so). You are absolutely right that collect is better for the uses discussed in the post. :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: