Basic XML processing with Scala

Topics: XML, Scala XML API, XML literals, marshalling

Introduction

Pretty much everybody knows what XML is: it is a structured, machine-readable text format for representing information that can be easily checked for the “grammaticality” of the tags, attributes, and their relationship to each other (e.g. using DTD’s). This contrasts with HTML, which can have elements that don’t close (e.g. <p>foo<p>bar rather than <p>foo</p><p>bar</p>) and still be processed. XML was only ever meant to be a format for machines, but it morphed into a data representation that many people ended up (unfortunately, for them) editing by hand. However, even as a machine readable format it has problems, such as being far more verbose than is really required, which matters quite a bit when you need to transfer lots of data from machine to machine — in the next post, I’ll discuss JSON and Avro, which can be viewed as evolutions of what XML was intended for and which work much better for lots of the applications that matter in the “big data” context. Regardless, there is plenty of legacy data that was produced as XML, and there are many communities (e.g. the digital humanities community) who still seem to adore XML, so people doing any reasonable amount of text analysis work will likely find themselves eventually needing to work with XML-encoded data.

There are a lot of tutorials on XML and Scala — just do a web search for “Scala XML” and you’ll get them. As with other blog posts, this one is aimed at being very explicit so that beginners can see examples with all the steps in them, and I’ll use it to set up a JSON processing post.

A simple example of XML

To start things off, let’s consider a very basic example of creating and processing a bit of XML.

The first thing to know about XML in Scala is that Scala can process XML literals. That is, you don’t need to put quotes around XML strings — instead, you can just write them directly, and Scala will automatically interpret them as XML elements (of type scala.xml.Element).

scala> val foo = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>
foo: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

Now let’s do a little bit of processing on this. You can get all the text by using the text method.

scala> foo.text
res0: String = hi1yellow

So, that munged all the text together. To get them printed out with spaces between, let’s first get all the bar nodes and then get their texts and use mkString on that sequence. To get the bar nodes, we can use the \ selector.

scala> foo \ "bar"
res1: scala.xml.NodeSeq = NodeSeq(<bar type="greet">hi</bar>, <bar type="count">1</bar>, <bar type="color">yellow</bar>)

This gives us back a sequence of the bar nodes that occur directly under the foo node. Note that the \ operator (selector) is just a mirror image of the / selector used in XPath.

Of course, now that we have such a sequence, we can map over it to get what we want. Since the text method returns the text under a node, we can do the following.

scala> (foo \ "bar").map(_.text).mkString(" ")
res2: String = hi 1 yellow

To grab the value of the type attribute on each node, we can use the \ selector followed by “@type”.

scala> (foo \ "bar").map(_ \ "@type")
res3: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(greet, count, color)

(foo \ "bar").map(barNode => (barNode \ "@type", barNode.text))
res4: scala.collection.immutable.Seq[(scala.xml.NodeSeq, String)] = List((greet,hi), (count,1), (color,yellow))

Note that the \ selector can only retrieve children of the node you are selecting from. To dig arbitrarily deep to pull out all nodes of a given type no matter where they are, use the \\ selector. Consider the following (bizarre) XML snippet with ‘z’ nodes at different levels of embedding.

<a>
  <z x="1"/>
  <b>
    <z x="2"/>
    <c>
      <z x="3"/>
    </c>
    <z x="4"/>
  </b>
</a>

Let’s first put it into the REPL.

scala> val baz = <a><z x="1"/><b><z x="2"/><c><z x="3"/></c><z x="4"/></b></a>
baz: scala.xml.Elem = <a><z x="1"></z><b><z x="2"></z><c><z x="3"></z></c><z x="4"></z></b></a>

If we want to get all of the ‘z’ nodes, we do the following.

scala> baz \\ "z"
res5: scala.xml.NodeSeq = NodeSeq(<z x="1"></z>, <z x="2"></z>, <z x="3"></z>, <z x="4"></z>)

And we can of course easily dig out the values of the x attributes on each of the z’s.

scala> (baz \\ "z").map(_ \ "@x")
res6: scala.collection.immutable.Seq[scala.xml.NodeSeq] = List(1, 2, 3, 4)

Throughout all of the above, we have used XML literals — that is, expressions typed directly into Scala, which interprets them as XML types. However, we usually need to process XML that is saved in a file, or a string, so the scala.xml.XML object has several methods for creating scala.xml.Elem objects from other sources. For example, the following allows us to create XML from a string.

scala> val fooString = """<foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>"""
fooString: java.lang.String = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

scala> val fooElemFromString = scala.xml.XML.loadString(fooString)
fooElemFromString: scala.xml.Elem = <foo><bar type="greet">hi</bar><bar type="count">1</bar><bar type="color">yellow</bar></foo>

This Elem is the same as the one created using the XML literal, as shown by the following test.

scala> foo == fooElemFromString
res7: Boolean = true

See the Scala XML object for other ways to create XML elements, e.g. from InputStreams, Files, etc.

A richer XML example

As a more interesting example of some XML to process, I’ve created the following short XML string describing artist, albums, and songs, which you can see in the github gist music.xml.

https://gist.github.com/2597611

I haven’t put any special care into this, other than to make sure it has embedded tags, some of which have attributes, and some reasonably interesting content (and some great songs).

You should save this in a file called /tmp/music.xml. Once you’ve done that, you can run the following code, which just prints out each artist, album and song, with an indent for each level.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

(musicElem \ "artist").foreach { artist =>
  println((artist \ "@name").text + "\n")
  val albums = (artist \ "album").foreach { album =>
    println("  " + (album \ "@title").text + "\n")
    val songs = (album \ "song").foreach { song =>
      println("    " + (song \ "@title").text)
    }
  println
  }
}

Converting objects to and from XML

One of the use cases for XML is to provide a machine-readable serialization format for objects that can still be easily read, and at times edited, by humans. The process of shuffling objects from memory into a disk-format like XML is called marshalling. We’ve started with some XML, so what we’ll do is define some classes and “unmarshall” the XML into objects of those classes. Put the following into the REPL. (Tip: You can use “:paste” to enter multi-line statements like those below. These will work without paste, but it is necessary to use it in some contexts, e.g. if you define Artist before Song.)

case class Song(val title: String, val length: String) {
  lazy val time = {
    val Array(minutes, seconds) = length.split(":")
    minutes.toInt*60 + seconds.toInt
  }
}

case class Album(val title: String, val songs: Seq[Song], val description: String) {
  lazy val time = songs.map(_.time).sum
  lazy val length = (time / 60)+":"+(time % 60)
}

case class Artist(val name: String, val albums: Seq[Album])

Pretty simple and straightforward. Note the use of lazy vals for defining things like the time (length in seconds) of a song. The reason for this is that if we create a Song object but never ask for its time, then the code needed to compute it from a string like “4:38″ is never run; however, if we had left lazy off, then it would be computed when the Song object is created. Also, we don’t want to use a def here (i.e. make time a method) because its value is fixed based on the length string; using a method would mean recomputing time every time it is asked for of a particular object.

Given the classes above, we can create and use objects from them by hand.

scala> val foobar = Song("Foo Bar", "3:29")
foobar: Song = Song(Foo Bar,3:29)

scala> foobar.time
res0: Int = 209

Using the native Scala XML API

Of course, we’re more interested in constructing Artist, Album, and Song objects from information specified in files like the music example. Though I don’t show the REPL output here, you should enter all of the commands below into it to see what happens.

To start off, make sure you have loaded the file.

val musicElem = scala.xml.XML.loadFile("/tmp/music.xml")

Now we can work with the file to select various elements, or create objects of the classes defined above. Let’s start with just Songs. We can ignore all the artists and albums and dig straight in with the \\ operator.

val songs = (musicElem \\ "song").map { song =>
  Song((song \ "@title").text, (song \ "@length").text)
}

scala> songs.map(_.time).sum
res1: Int = 11311

And, we can go all the way and construct Artist, Album and Song objects that directly mirror the data stored in the XML file.

val artists = (musicElem \ "artist").map { artist =>
  val name = (artist \ "@name").text
  val albums = (artist \ "album").map { album =>
    val title = (album \ "@title").text
    val description = (album \ "description").text
    val songList = (album \ "song").map { song =>
      Song((song \ "@title").text, (song \ "@length").text)
    }
    Album(title, songList, description)
  }
  Artist(name, albums)
}

With the artists sequence in hand, we can do things like showing the length of each album.

val albumLengths = artists.flatMap { artist =>
  artist.albums.map(album => (artist.name, album.title, album.length))
}
albumLengths.foreach(println)

Which gives the following output.

(Radiohead,The King of Limbs,37:34)
(Radiohead,OK Computer,53:21)
(Portished,Dummy,48:46)
(Portished,Third,48:50)

Marshalling objects to XML

In addition to constructing objects from XML specifications (also referred to as de-serializing and un-marshalling), it is often necessary to marshal objects one has constructed in code to XML (or other formats). The use of XML literals is actually quite handy in this regard. To see this, let’s start with the first song of the first album of the first album (Bloom, by Radiohead).

scala> val bloom = artists(0).albums(0).songs(0)
bloom: Song = Song(Bloom,5:15)

We can construct an Elem from this as follows.

scala> val bloomXml = <song title={bloom.title} length={bloom.length}/>
bloomXml: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

The thing to note here is that an XML literal is used, but when we want to use values from variables, we can escape from literal-mode with curly brackets. So, {bloom.title} becomes “Bloom”, and so on. In contrast, one could do it via a String as follows.

scala> val bloomXmlString = "<song title=\""+bloom.title+"\" length=\""+bloom.length+"\"/>"
bloomXmlString: java.lang.String = <song title="Bloom" length="5:15"/>

scala> val bloomXmlFromString = scala.xml.XML.loadString(bloomXmlString)
bloomXmlFromString: scala.xml.Elem = <song length="5:15" title="Bloom"></song>

So, the use of literals is a bit more readable (though it comes at the cost of making it hard in Scala to use “<” as an operator for many use cases, which is one of the reasons XML literals are considered by many to be not a great idea).

We can create the whole XML for all of the artists and albums in one fell swoop. Note that one can have XML literals in the escaped bracketed portions of an XML literal, which allows the following to work. Note: you need to use the :paste mode in the REPL in order for this to work.

val marshalled =
  <music>
  { artists.map { artist =>
    <artist name={artist.name}>
    { artist.albums.map { album =>
      <album title={album.title}>
      { album.songs.map(song => <song title={song.title} length={song.length}/>) }
      <description>{album.description}</description>
      </album>
    }}
    </artist>
  }}
</music>

Note that in this case, the for-yield syntax is perhaps a bit more readable since it doesn’t require the extra curly braces.

val marshalledYield =
<music>
  { for (artist <- artists) yield
    <artist name={artist.name}>
    { for (album <- artist.albums) yield
      <album title={album.title}>
      { for (song <- album.songs) yield <song title={song.title} length={song.length}/> }
        <description>{album.description}</description>
      </album>
    }
    </artist>
  }
</music>

One could of course instead add a toXml method to each of the Song, Album, and Artist classes such that at the top level you’d have something like the following.

val marshalledWithToXml =  <music> { artists.map(_.toXml) } </music>

This is a fairly common strategy. However, note that the problem with this solution is that it produces a very tight coupling between the program logic (e.g. of what things like Songs, Albums and Artists can do) with other, orthogonal logic, like serializing them. To see a way of decoupling such different needs, check out Dan Rosen’s excellent tutorial on type classes.

Conclusion

The standard Scala XML API comes packaged with Scala, and it is actually quite nice for some basic XML processing. However, it caused some “controversy” in that it was felt by many that the core language has no business providing specialized processing for a format like XML. Also, there are some efficiency issues. Anti-XML is a library that seeks to do a better job of processing XML (especially in being more scalable and more flexible in allowing programmatic editing of XML). As I understand things, Anti-XML may become a sort of official XML processing library in the future, with the current standard XML library being phased out. Nonetheless, many of the ways of interacting with an XML document shown above are similar, so being familiar with the standard Scala XML API provides the core concepts you’ll need for other such libraries.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to http://www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.

About these ads
9 comments
  1. Pingback: JavaPins

  2. Jason, is there a way to insert a new line between the elements ‘yield’-ed by ‘for’?
    I mean, to have { for (i <- 1 to 2) yield } produce something like this:

    rather than ?

    I’ve seen that you can play a bit with new-lines *inside* the yielded element, but I haven’t found a way to add one *between* the loops.

    Many thanks!

  3. Phil said:

    Great tutorial, many thanks, I’m just starting to learn Scala.

    I’m not sure I agree with your statements about XML – as far as I understand it (coming from an SGML background originally), XML was designed for modelling documents, not data, and it was meant for consumption by both humans and machines.

    Where it went wrong was people starting using it for data, which was overkill. Thankfully JSON came to the rescue there. But if you want to model documents, with mixed content, inline entities etc, then XML still makes sense.

    • All fair points! I guess I see it rarely used for actual documents and mainly for data…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,213 other followers

%d bloggers like this: