I’ve been reading papers about deep learning for several years now, but until recently hadn’t dug in and implemented any models using deep learning techniques for myself. To remedy this, I started experimenting with Deeplearning4J a few weeks ago, but with limited success. I read more books, primers and tutorials, especially the amazing series of blog posts by Chris Olah and Denny Britz. Then, with incredible timing for me, Google released TensorFlow to much general excitement. So, I figured I’d give it a go, especially given Delip Rao’s enthusiasm for it—he even compared the move from Theano to TensorFlow feeling like changing from “a Honda Civic to a Ferrari.”

Here’s a quick prelude before getting to my initial simple explorations with TensorFlow. As most people (hopefully) know, deep learning encompasses ideas going back many decades (done under the names of connectionism and neural networks) that only became viable at scale in the past decade with the advent of faster machines and some algorithmic innovations. I was first introduced to them in a class taught by my PhD advisor, Mark Steedman, at the University of Pennsylvania in 1997. He was especially interested in how they could be applied to language understanding, which he wrote about in his 1999 paper “Connectionist Sentence Processing in Perspective.” I wish I understood more about that topic (and many others) back then, but then again that’s the nature of being a young grad student. Anyway, Mark’s interest in connectionist language processing arose in part from being on the dissertation committee of James Henderson, who completed his thesis “Description Based Parsing in a Connectionist Network” in 1994. James was a post-doc in the Institute for Research in Cognitive Science at Penn when I arrived in 1996. As a young grad student, I had little idea of what connectionist parsing entailed, and my understanding from more senior (and far more knowledgeable) students was that James’ parsers were really interesting but that he had trouble getting the models to scale to larger data sets—at least compared to the data-driven parsers that others like Mike Collins and Adwait Ratnarparkhi were building at Penn in the mid-1990s. (Side note: for all the kids using logistic regression for NLP out there, you probably don’t know that Adwait was the one who first applied LR/MaxEnt to several NLP problems in his 1998 dissertation “Maximum Entropy Models for Natural Language Ambiguity Resolution“, in which he demonstrated how amazingly effective it was for everything from classification to part-of-speech tagging to parsing.)

Back to TensorFlow and the present day. I flew from Austin to Washington DC last week, and the morning before my flight I downloaded TensorFlow, made sure everything compiled, downloaded the necessary datasets, and opened up a bunch of tabs with TensorFlow tutorials. My goal was, while on the airplane, to run the tutorials, get a feel for the flow of TensorFlow, and then implement my own networks for doing some made-up classification problems. I came away from the exercise extremely pleased. This post explains what I did and gives pointers to the code to make it happen. My goal is to help out people who could use a bit more explicit instruction and guidance using a complete end-to-end example with easy to understand data. I won’t give lots of code examples in this post as there are several tutorials that already do that quite well—the value here is in the simple end-to-end implementations, the data to go with them, and a bit of explanation along the way.

As a preliminary, I recommend going to the excellent TensorFlow documentation, downloading it, and running the first example. If you can do that, you should be able to run the code I’ve provided to go along with this post in my try-tf repository on Github.

Simulated data

As a researcher who works primarily on empirical methods in natural language processing, my usual tendency is to try new software and ideas out on language data sets, e.g. text classification problems and the like. However, after hanging out with a statistician like James Scott for many years, I’ve come to appreciate the value of using simulated datasets early on to reduce the number of unknowns while getting the basics right. So, when sitting down with TensorFlow, I wanted to try three simulated data sets: linearly separable data, moon data and saturn data. The first is data that linear classifiers can handle easily, while the latter two require the introduction of non-linearities enabled by models like multi-layer neural networks. Here’s what they look like, with brief descriptions.

The linear data has two clusters that can be separated by a diagonal line from top left to bottom right:


Linear classifiers like perceptrons, logistic regression, linear discriminant analysis, support vector machines and others do well with this kind of data because learning these lines (hyperplanes) is exactly what they do.

The moon data has two clusters in crescent shapes that are tangled up such that no line can keep all the orange dots on one side without also including blue dots.


Note: see Implementing a Neural Network from Scratch in Python for a discussion working with the moon data using Theano.

The saturn data has a core cluster representing one class and a ring cluster representing the other.saturn_data_train.jpg

With the saturn data, a line is catastrophically bad. Perhaps the best one can do is draw a line that has all the orange points to one side. This ensures a small, entirely blue side, but it leaves the majority of blue dots in orange terroritory.

Example data has been generated in try-tf/simdata for each of these datasets, including a training set and test set for each. These are for the two dimensional cases visualized above, but you can use the scripts in that directory to generate data with other parameters, including more dimensions, greater variances, etc. See the commented out code for help to visualize the outputs, or adapt plot_data.R, which visualizes 2-d data in CSV format. See the  README for instructions.

Related: check out Delip Rao’s post on learning arbitrary lambda expressions.

Softmax regression

Let’s start with a network that can handle the linear data, which I’ve written in The TensorFlow page has pretty good instructions for how to define a single layer network for MNIST, but no end-to-end code that defines the network, reads in data (consisting of label plus features), trains and evaluates the model. I found writing this to be a good way to familiarize myself with the TensorFlow Python API, so I recommend trying it yourself before looking at my code and then referring to it if you get stuck.

Let’s run it and see what we get.

$ python --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv
Accuracy: 0.99

This performs one pass (epoch) over the training data, so parameters were only updated once per example. 99% is good held-out accuracy, but allowing two training epochs gets us to 100%.

$ python --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv --num_epochs 2
Accuracy: 1.0

There’s a bit of code in to handle options and read in data. The most important lines are the ones that define the input data, the model, and the training step. I simply adapted these from the MNIST beginners tutorial, but puts it all together and provides a basis for transitioning to the network with a hidden layer discussed later in this post.

To see a little more, let’s turn on the verbose flag and run for 5 epochs.

$ python --train simdata/linear_data_train.csv --test simdata/linear_data_eval.csv --num_epochs 5 --verbose True

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49

Weight matrix.
[[-1.87038445 1.87038457]
[-2.23716712 2.23716712]]

Bias vector.
[ 1.57296884 -1.57296848]

Applying model to first test instance.
Point = [[ 0.14756215 0.24351828]]
Wx+b = [[ 0.7521798 -0.75217938]]
softmax(Wx+b) = [[ 0.81822371 0.18177626]]

Accuracy: 1.0

Consider first the weights and bias. Intuitively, the classifier should find a separating hyperplane between the two classes, and it probably isn’t immediately obvious how W and b define that. For now, consider only the first column with w1=-1.87038457, w2=-2.23716712 and b=1.57296848. Recall that w1 is the parameter for the `x` dimension and w2 is for the `y` dimension. The separating hyperplane satisfies Wx+b=0; from which we get the standard y=mx+b form.

Wx + b = 0
w1*x + w2*y + b = 0
w2*y = -w1*x – b
y = (-w1/w2)*x – b/w2

For the parameters learned above, we have the line:

y = -0.8360504*x + 0.7031074

Here’s the plot with the line, showing it is an excellent fit for the training data.


The second column of weights and bias separate the data points at the same place as the first, but mirrored 180 degrees from the first column. Strictly speaking, it is redundant to have two output nodes since a multinomial distribution with n outputs can be represented with n-1 parameters (see section 9.3 of Andrew Ng’s notes on supervised learning for details). Nonetheless, it’s convenient to define the network this way.

Finally, let’s try the softmax network on the moon and saturn data.

python --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 2
Accuracy: 0.856

$ python --train simdata/saturn_data_train.csv --test simdata/saturn_data_eval.csv --num_epochs 2
Accuracy: 0.45

As expected, it doesn’t work very well!

Network with a hidden layer

The program implements a network with a single hidden layer, and you can set the size of the hidden layer from the command line. Let’s try first with a two-node hidden layer on the moon data.

$ python --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 2
Accuracy: 0.88

So,that was an improvement over the softmax network. Let’s run it again, exactly the same way.

$ python --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 2
Accuracy: 0.967

Very different! What we are seeing is the effect of random initialization, which has a large effect on the learned parameters given the small, low-dimensional data we are dealing with here. (The network uses Xavier initialization for the weights.) Let’s try again but using three nodes.

$ python --train simdata/moon_data_train.csv --test simdata/moon_data_eval.csv --num_epochs 100 --num_hidden 3
Accuracy: 0.969

If you run this several times, the results don’t vary much and hover around 97%. The additional node increases the representational capacity and makes the network less sensitive to initial weight settings.

Adding more nodes doesn’t change results much—see the WildML post using the moon data for some nice visualizations of the boundaries being learned between the two classes for different hidden layer sizes.

So, a hidden layer does the trick! Let’s see what happens with the saturn data.

$ python --train simdata/saturn_data_train.csv --test simdata/saturn_data_eval.csv --num_epochs 50 --num_hidden 2
Accuracy: 0.76

With just two hidden nodes, we already have a substantial boost from the 45% achieved by softmax regression. With 15 hidden nodes, we get 100% accuracy. There is considerable variation from run to run (due to random initialization). As with the moon data, there is less variation as nodes are added. Here’s a plot showing the increase in performance from 1 to 15 nodes, including ten accuracy measurements for each node count.


The line through the middle is the average accuracy measurement for each node count.

Initialization and activation functions are important

My first attempt at doing a network with a hidden layer was to merge what I had done in with the network in, provided with TensorFlow tutorials. This was a useful exercise to get a better feel for the TensorFlow Python API, and helped me understand the programming model much better. However, I found that I needed to have upwards of 25 or more hidden nodes in order to reliably get >96% accuracy on the moon data.

I then looked back at the WildML moon example and figured something was quite wrong since just three hidden nodes were sufficient there. The differences were that the MNIST example initializes its hidden layers with truncated normals instead of normals divided by the square root of the input size, initializes biases at 0.1 instead of 0 and uses ReLU activations instead of tanh. By switching to Xavier initialization (using Delip’s handy function), 0 biases, and tanh, everything worked as in the WildML example. I’m including my initial version in the repo as so that others can see the difference and play around with it. (It turns out that what matters most is the initialization of the weights.)

This is a simple example of what is often discussed with deep learning methods: they can work amazingly well, but they are very sensitive to initialization and choices about the sizes of layers, activation functions, and the influence of these choices on each other. They are a very powerful set of techniques, but they (still) require finesse and understanding, compared to, say, many linear modeling toolkits that can effectively be used as black boxes these days.


I walked away from this exercise very encouraged! I’ve been programming in Scala mostly for the last five years, so it required dusting off my Python (which I taught in my classes at UT Austin from 2005-2011, e.g. Computational Linguistics I and Natural Language Processing), but I found it quite straightforward. Since I work primarily with language processing tasks, I’m perfectly happy with Python since it’s a great language for munging language data into the inputs needed by packages like TensorFlow. Also, Python works well as a DSL for working with deep learning (it seems like there is a new Python deep learning package announced every week these days). It took me less than four hours to go through initial examples, and then build the softmax and hidden networks and apply them to the three data sets. (And a bunch of that time was me remembering how to do things in Python.)

I’m now looking forward to trying deep learning models, especially convnets and LSTM’s, on language and image tasks. I’m also going to go back to my Scala code for trying out Deeplearning4J to see if I can get these simulation examples to run as I’ve shown here with TensorFlow. (I would welcome pull requests if someone else gets to that first!) As a person who works primarily on the JVM, it would be very handy to be able to work with DL4J as well.

After that, maybe I’ll write out the re-occurring rant going on in my head about deep learning not removing the need for feature engineering (as many backpropagandists seem to like to claim), but instead changing the nature of feature engineering, as well as providing a really cool set of new capabilities and tricks.

Bozhidar Bozhanov wrote an blog post titled “The Low Quality of Academic Code“, in which he observed that most academic software is poorly written. He’s makes plenty of fair points, e.g.:

… there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

But, here’s the thing — I would argue that this lack of engineering quality in academic software is a feature, not a bug. For academics, there is basically little to no incentive to produce high quality software, and that is how it should be. Our currency is ideas and publications based on them, and those are obtained not by creating wonderful software, but by having great results. We have limited time, and that time is best put into thinking about interesting models and careful evaluation and analysis. The code is there to support that, and is fine as long as it is correct.

The truly important metric for me is whether the code supports replication of results from the paper it supports. The code can be as ugly as you can possibly imagine as long as it does this. Unfortunately, a lot of academic software doesn’t make replication easy. Nonetheless, having the code open sourced makes it at least possible to hack with it to try to replicate previous results. In the last few years, I’ve personally put a lot of effort into having my work and my students’ work easy to replicate. I’m particularly proud of how I put code, data and documentation together for a paper I did on topic model evaluation with James Scott for AISTATS in 2013, “A recursive estimate for the predictive likelihood in a topic model.” That was a lot of work, but I’ve already benefited from it myself (in terms of being able to get the data and run my own code). Check out the “code” links in some of my other papers for some other examples that my students have done for their research.

Having said the above, I think it is really interesting to see how people who have made their code easy to use (though not always well-engineered) have benefited from doing so in the academic realm. A good example is word2vec and how the software that was released for it generated tons of interest in industry as well as academia and probably led to much wider dissemination of that work, and to more follow on work. Academia itself doesn’t reward that directly, nor should it. That’s one reason you see it coming out of companies like Google, but it might be worth it to some researchers in some cases, especially PhD students who seek industry jobs after they defend their dissertation.

I read an blog post last year in which the author encouraged people to open source their code and not worry about how crappy it was. (I wish I could remember the link, so if you have it, please add in a comment. Here is the post, “It’s okay for your open source library to be a bit shitty.“) I think this is a really important point. We should be careful to not get overly critical about code that people have made available to the world for free—not because we don’t want to damage their fragile egos, but because we want to make sure that people generally feel comfortable open sourcing. This is especially important for academic code, which is often the best recipe, no matter how flawed it might be, that future researchers can use to replicate results and produce new work that meaningfully builds on or compares to that work.

Update: Adam Lopez pointed out this very nice related article by John Regehr “Producing good software from academia“.

Addendum: When I was a graduate student at the University of Edinburgh, I wrote a software package called OpenNLP Maxent (now part of the OpenNLP toolkit, which I also started then and which is still used widely today). While I was still a student, a couple of companies paid me to improve aspects of the code and documentation, which really helped me make ends meet at the time and made the code much better. I highly encourage this model — if there is an academic open source package that you think your company could benefit from, consider hiring the grad student author to make it better for the things that matter for your needs! (Or do it yourself and do a pull request, which is much easier today with Github than it was in 2000 with Sourceforge.)

Update: Thanks to the commenters below for providing the link to the post I couldn’t remember, It’s okay for your open source library to be a bit shitty.! As a further note, the author surprisingly connects this topic to feminism in a cool way.

Topics: twitter,twitter4j,word clouds


My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Getting set up: code and authorization

Rather than having the reader build the code up while going through the tutorial, I’ve set up the code in the repository twitter4j-tutorial. The version needed for this tutorial as v0.2.0. You can download a tarball of that version, which may be easier to work with if there have been further developments to the repository since the writing of this tutorial. Checkout or download that code now. The main file of interest is:

  • src/main/scala/TwitterUser.scala

This tutorial is mainly a walk through for that file in blog form, with some additional pointers and explanations here and there.

You also need to set up the authorization details. See “Setting up authorization” section of the previous post to do this if you haven’t already.


IMPORTANT: for this tutorial you must set the permissions for your application to be “Read and Write“. This does NOT mean to use ‘chmod’. It means going to the Twitter developers application site, signing in with your Twitter account, clicking on “Settings” and setting the permissions to read and write.


In the previous tutorial, authorization details were put into code. This time, we’ll use a file. This is easy: just add a file with that name to the twitter4j-tutorial directory with the following contents, substituting your details as appropriate.

oauth.consumerKey=[your consumer key here]
oauth.consumerSecret=[your consumer secret here]
oauth.accessToken=[your access token here]
oauth.accessTokenSecret=[your access token secret here]

Rate limits and a note of caution

Unlike streaming access to Twitter, performing user actions via the API is subject to rate limits. Once you hit your limit, Twitter will throw an exception and refuse to comply with your requests until a period of time has passed (usually 15 minutes). Twitter does this to limit bad bots and also preserve their computational resources. For more information on rate limits, see Twitter’s page about rate limiting.

I’ll discuss how to manage rate limits later in the post, but I mention them up front in case you exceed them while messing around with things early on.

A word of caution is also in order: since you are going to be able to take actions automatically, like following users, posting a status, and retweeting, you could end up doing many of these actions in rapid succession. This will (a) use up your rate limit very quickly, (b) probably not be interesting behavior, and (c) could get your account suspended. Make sure to follow the rules, especially those on following users.

If you are going to mess around quite a bit with actual posting, you may also want to consider creating an account that is not your primary Twitter account so that you don’t annoy your actual followers. (Suggestion: see the paragraph on “Create account” in part one of project phase one of my Applied NLP course for tips on how to add multiple accounts with the same gmail address.)

Basic interactions: searching, timelines, posting

All of the examples belowe are implemented as objects with main methods that do something using a twitter4j.Twitter object. To make it so we don’t have to call the TwitterFactory repeatedly, we first define a trait that gets a Twitter instance set up and ready to use.

trait TwitterInstance {
  val twitter = new TwitterFactory().getInstance

By extending this trait, our objects can access the twitter object conveniently.

As a first simple example, we can search for tweets that match a query by using the search method. The following object takes a query string given on the command line query, searches for tweets using that query, and prints them.

object QuerySearch extends TwitterInstance {

  def main(args: Array[String]) {
    val statuses = Query(args(0))).getTweets
    statuses.foreach(status => println(status.getText + "\n"))


Note that this uses a Query object, whereas with using a TwitterStream, a FilterQuery was needed. Also, for this to work, we must have the following import available:

import collection.JavaConversions._

This ensures that we can use the java.util.List returned by the getTweets method (of twitter4j.QueryResult) as if it were a Scala collection with the method foreach (and map, filter, etc). This is done via implicit conversions that make working with Java libraries far nicer than it would be otherwise.

To run this, go to the twitter4j-tutorial directory, and do the following (some example output shown):

$ ./build
> run-main bcomposes.twitter.QuerySearch scala
[info] Running bcomposes.twitter.QuerySearch scala
E' avvilente non sentirsi all'altezza di qualcosa o qualcuno, se non si possiede quella scala interiore sulla quale l'autostima pu? issarsi

Scala workshop will run with ECOOP, July 2nd in Montpellier, South of France. Call for papers is out.

#scala Even two of them in #cologne #germany . #thumbsup

RT @MILLIB2DAL: @djcameo Birthday bash 30th march @ Scala nightclub 100 artists including myself make sur u reach its gonna be #Legendary

@kot_2010 I think it's the same case with Scala: with macros it will tend to "outsource" things to macro libs, keeping a small lang core.

RT @waxzce: #scala hiring or job ? go there :

@esten That's not only a front-end problem. Scala devs should use scalaz.Equal and === for type safe equality. /cc @sharonw


[success] Total time: 1 s, completed Feb 26, 2013 1:54:44 PM

You might see some extra communications from SBT, which will probably need to download dependencies and compile the code. For the rest of the examples below, you can run them in a similar manner, substituting the right object name and providing any necessary arguments.

There are various timelines available for each user, including the home timeline, mentions timeline, and user timeline. They are accessible as twitter4j.api.TimelineResources. For example, the following object shows the most recent statuses on the authenticating user’s home timeline (which are the tweets by people the user follows).

object GetHomeTimeline extends TwitterInstance {

  def main(args: Array[String]) {
    val num = if (args.length == 1) args(0).toInt else 10
    val statuses = twitter.getHomeTimeline.take(num)
    statuses.foreach(status => println(status.getText + "\n"))


The number of tweets to show is given as the command-line argument.

You can also update the status of the authenticating user from the command line using the following object. Calling it will post to the authenticating user’s account (so only do it if you are comfortable with the command-line argument you give it going onto your timeline).

object UpdateStatus extends TwitterInstance {
  def main(args: Array[String]) {
    twitter.updateStatus(new StatusUpdate(args(0)))

There are plenty of other useful methods that you can use to interact with Twitter, and if you have successfully run the above three, you should be able to look at the Twitter4j javadocs and start using them. Some examples doing more interesting things are given below.

Replying to tweets written to you

The following object goes through the most recent tweets that have mentioned the authenticating user, and replies “OK.” to them. It includes the author of the original tweet and any other entities that were mentioned in it.

object ReplyOK extends TwitterInstance {

  def main(args: Array[String]) {
    val num = if (args.length == 1) args(0).toInt else 10
    val userName = twitter.getScreenName
    val statuses = twitter.getMentionsTimeline.take(num)
    statuses.foreach { status => {
      val statusAuthor = status.getUser.getScreenName
      val mentionedEntities =
      val participants = (statusAuthor :: mentionedEntities).toSet - userName
      val text =>"@"+p).mkString(" ") + " OK."
      val reply = new StatusUpdate(text).inReplyToStatusId(status.getId)
      println("Replying: " + text)


This should be mostly self-explanatory, but there are a couple of things to note. First, you can find all the entities that have been mentioned (via @-mentions) in the tweet via the method getUserMentionEntities of the twitter4j.Status class. The code ensures that the author of the original tweet (who isn’t necessarily mentioned in it) is included as a participant for the response, and also we take out the authenticating user. So, if the message “@tshrdlu What do you think of @tshrdlc?” is sent from @jasonbaldridge, the response will be “@jasonbaldridge @tshrdlc OK.” Note how the screen names do not have the @ symbol, so that must be added in the tweet text of the reply.

Second, notice that StatusUpdate objects can be created by chaining methods that add more information to them, e.g. setInReplyToStatusId and setLocation, which incrementally build up the StatusUpdate object that gets actually posted. (This is a common Java strategy that basically helps get around the fact that parameters to classes can neither be specified by name in Java nor have defaults, the way Scala does.)

Checking and managing rate limit information

None of the above code makes many requests from Twitter, so there was little danger of exceeding rate limits. These limits are a mixture of both time and number of requests: you basically get a certain number of requests every hour (currently 350) per authenticating user. Because of these limits, you should consider accessing tweets, timelines, and such using the streaming methods when you can.

Every response you get from Twitter comes back as a sub-class of twitter4j.TwitterResponse, which not only gives you what you want (like a QueryResult) but also gives you information about your connection to Twitter. For rate limit information, you can use the getRateLimitStatus method, which can then inform you about the number of requests you can still make and the time until your limit resets.

The trait RateChecker below has a function checkAndWait that, when given a TwitterResponse object, checks whether the rate limit has been exceeded and wait if it has. When the rate is exceeded, it finds out how much time remains until the rate limit is reset and makes the thread sleep until that time (plus 10 seconds) has passed.

trait RateChecker {

  def checkAndWait(response: TwitterResponse, verbose: Boolean = false) {
    val rateLimitStatus = response.getRateLimitStatus
    if (verbose) println("RLS: " + rateLimitStatus)

    if (rateLimitStatus != null && rateLimitStatus.getRemaining == 0) {
      println("*** You hit your rate limit. ***")
      val waitTime = rateLimitStatus.getSecondsUntilReset + 10
      println("Waiting " + waitTime + " seconds ( " + waitTime/60.0 + " minutes) for rate limit reset.")


Using rate limits is actually more complex than this. For example, this strategy ignores the fact that different request types have different limits, but it keeps things simple. This is surely not an optimal solution, but it does the trick for present purposes.

Note also that you can directly ask for rate limit information from the twitter4j.Twitter instance itself, using the getRateLimitStatus method. Unlike the results for the same method on a TwitterResponse, this gives a Map from various request types to the current rate limit statuses for each one. In a real application, you’d want to control each of these different limits at a more fine-grained level using this information.

Not all of the methods of Twitter4j classes actually hit the Twitter API. To see whether a given method does, look at its Javadoc: if it’s description says “This method calls“, then it does hit the API. Otherwise, it doesn’t and you don’t need to guard it.

Examples using the checkAndWait function are given below.

Creating a word cloud from followers’ descriptions

Here’s a more interesting task: given a Twitter user, compute the counts of the words in the descriptions given in the bios of their followers and build a word cloud from them. The following code does this, outputing the resulting counts in a file, the contents of which can be pasted into Wordle’s advanced word cloud input.

object DescribeFollowers extends TwitterInstance with RateChecker {

  def main(args: Array[String]) {
    val screenName = args(0)
    val maxUsers = if (args.length==2) args(1).toInt else 500
    val followerIds = twitter.getFollowersIDs(screenName,-1).getIDs

    val descriptions = followerIds.take(maxUsers).flatMap { id => {
      val user = twitter.showUser(id)
      if (user.isProtected) None else Some(user.getDescription)

    val tword = """(?i)[a-z#@]+""".r.pattern
    val words = descriptions.flatMap(_.toLowerCase.split("\\s+"))
    val filtered = words.filter(_.length > 3).filter(tword.matcher(_).matches)
    val counts = filtered.groupBy(x=>x).mapValues(_.length)
    val rankedCounts = counts.toSeq.sortBy(- _._2)

    val wordcountFile = "/tmp/follower_wordcount.txt"
    val writer = new BufferedWriter(new FileWriter(wordcountFile))
    for ((w,c) <- rankedCounts)


The thing to consider is that if you are pointing this at a person with several hundred followers, you will exceed the rate limit. The call to getFollowersIDs is a single hit, and then each call to showUser is a hit. Because the showUser calls come in rapid succession, we check the rate limit status after each one using checkAndWait (which is available because we mixed in the RateChecker trait) and it waits for the limit to reset as previously discussed, keeping us from exceeding the rate limit and getting an exception from Twitter.

The number of users returned by getFollowersIDs is at most 5000. If you run this on a user who has more followers, followers beyond 5000 won’t be considered. If you want to tackle such a user, you’ll need to use the cursor, which is the integer provided as the argument to getFollowersIDs, and make multiple calls while incrementing that cursor to get more.

Most of the rest of the code is just standard Scala stuff for getting the word counts and outputting them to a file. Note that a small effort is done to reduce the non-alphabetic characters (but allowing # and @) and filtering out short words.

As an example of the output, when put into Wordle, here is the word cloud for my followers.


This looks about right for me—completely expected in fact—but it is still cool that it comes out of my followers’ self descriptions. One could start thinking of some fun algorithms for exploiting this kind of representation of a user to look into how well different users align or don’t align with their followers, or to look for clusters of different types of followers, etc.

Retweeting automatically

Tired of actually reading those tweets in your timeline and retweeting some of them? The following code gets some of the accounts the authenticating user follows, grabs twenty of those users, filters them to get interesting ones, and then takes up to 10 of the remaining ones and retweets their most recent statuses (provided they aren’t replies to someone else).

object RetweetFriends extends TwitterInstance with RateChecker {

  def main(args: Array[String]) {
    val friendIds = twitter.getFriendsIDs(-1).getIDs
    val friends = friendIds.take(20).map { id => {
      val user = twitter.showUser(id)

    val filtered = friends.filter(admissable)
    val ranked = => (f.getFollowersCount, f)).sortBy(- _._1).map(_._2)

    ranked.take(10).foreach { friend => {
      val status = friend.getStatus
      if (status!=null && status.getInReplyToStatusId == -1) {
        println("\nRetweeting " + friend.getName + ":\n" + status.getText)

  def admissable(user: User) = {
    val ratio = user.getFollowersCount.toDouble/user.getFriendsCount
    user.getFriendsCount < 1000 && ratio > 0.5


The getFriendsIDs method is used to get the users that the authenticating user is following (but who do not necessarily follow the authenticating user, despite the use of the word “friend”). We again take care with the rate limiting on gathering the users. We filter these users, looking for those who follow fewer than 1000 users and those who have a follower/friend ratio of greater than .5, in a simple attempt to filter out some less interesting (or spammy) accounts. The remaining users are then ranked according to their number of followers (most first). Finally, we take (up to) 10 of these (the take method returns 3 things if you ask for 10 but there are just 3), look at their most recent status, and if it is not null and isn’t a reply to someone, we retweet it. Between each of these, we wait for 30 seconds so that anyone following our account doesn’t get an avalanche of retweets.


This post and the related code should provide enough to get a decent feel for working with Twitter4j, including necessary setup and using some of the methods to start creating applications with it in Scala. See project phase three of my Applied NLP course to see exercises and code that takes this further to do interesting things for automated bots, including mixing streaming access and user access to get more complex behaviors.

Topics: twitter, twitter4j, sbt


My previous post provided a walk-through for using the Twitter streaming API from the command line, but tweets can be more flexibly obtained and processed using an API for accessing Twitter using your programming language of choice. In this tutorial, I walk-through basic setup and some simple uses of the twitter4j library with Scala. Much of what I show here should be useful for those using other JVM languages like Clojure and Java. If you haven’t gone through the previous tutorial, have a look now before going on as this tutorial covers much of the same material but using twitter4j rather than HTTP requests.

I’ll introduce code, bit by bit, for accessing the Twitter data in different ways. If you get lost with what should go where, all of the code necessary to run the commands is available in this github gist, so you can compare to that as you move through the tutorial.

Update: The tutorial is set up to take you from nothing to being able to obtain tweets in various ways, but you can also get all the relevant code by looking at the twitter4j-tutorial repository. For this tutorial, the tag is v0.1.0, and you can also download a tarball of that version.

Getting set up

An easy way to use the twitter4j library in the context of a tutorial like this is for the reader to set up a new SBT project, declare it as a dependency, and then compile and run code within SBT. (See my tutorial on using Jerkson for processing JSON with Scala for another example of this.) This sorts out the process of obtaining external libraries and setting up the classpath so that they are available. Follow the instructions in this section to do so.

$ mkdir ~/twitter4j-tutorial
$ cd ~/twitter4j-tutorial/
$ wget

Now, save the following as the file ~/twitter4j-tutorial/build.sbt. Be aware that it is important to keep the empty lines between each of the declarations.

name := "twitter4j-tutorial"

version := "0.1.0 "

scalaVersion := "2.10.0"

libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"

Then save the following as the file ~/twitter4j-tutorial/build.

java -Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=384M -jar `dirname $0`/sbt-launch.jar "$@"

Make that file executable and run it, which will show SBT doing a bunch of work and then leave you with the SBT prompt. At the SBT prompt, invoke the update command.

$ cd ~/twitter4j-tutorial
$ chmod a+x build
$ ./build
[info] Set current project to twitter4j-tutorial (in build file:/Users/jbaldrid/twitter4j-tutorial/)
> update
[info] Updating {file:/Users/jbaldrid/twitter4j-tutorial/}default-570731...
[info] Resolving org.twitter4j#twitter4j-core;3.0.3 ...
[info] Done updating.
[success] Total time: 1 s, completed Feb 8, 2013 12:55:41 PM

To test whether you have access to twitter4j now, go to the SBT console and import the classes from the main twitter4j package.

> console
[info] Starting scala interpreter...
Welcome to Scala version 2.10.0 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_37).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import twitter4j._
import twitter4j._

If nothing further is output, then you are all set (exit the console using CTRL-D). If things are amiss (or if you are running in the default Scala REPL), you’ll instead see something like the following.

scala> import twitter4j._
<console>:7: error: not found: value twitter4j
import twitter4j._

If this is what you got, try to follow the instructions above again to make sure that your setup is exactly as above (check the versions, etc).

If you just want to see some examples of using twitter4j as an API and are happy adding its jars by hand to your classpath or are using an IDE like Eclipse, then it is unnecessary to do the SBT setup — just read on and adapt the examples as necessary.

Write, compile and run a simple main method

To set the stage for how we’ll run programs in this tutorial, let’s create a simple main method and ensure it can be run in SBT. Do the following:

$ mkdir -p ~/twitter4j-tutorial/src/main/scala/

Next, save the following code as ~/twitter4j-tutorial/src/main/scala/TwitterStream.scala.

package bcomposes.twitter

import twitter4j._

object StatusStreamer {
  def main(args: Array[String]) {

Next, at the SBT prompt for the twitter4j-tutorial project, use the run-main command as follows.

> run-main bcomposes.twitter.StatusStreamer
[info] Compiling 1 Scala source to /Users/jbaldrid/twitter4j-tutorial/target/scala-2.10/classes...
[info] Running bcomposes.twitter.StatusStreamer
[success] Total time: 2 s, completed Feb 8, 2013 1:36:32 PM

SBT compiles the code, and then runs it. This is a generally handy way of running code with all the dependencies available without having to worry about explicitly handling the classpath.

In what comes below, we’ll flesh out that main method so that it does more interesting work.

Setting up authorization

When using the Twitter streaming API to access tweets via HTTP requests, you must supply your Twitter username and password. To use twitter4j, you also must provide authentication details; however, for this you need to set up OAuth authentication. This is straightforward:

  1. Go to and click on the button that says “Create a new application” (of course, you’ll need to log in with your Twitter username and password in order to do this)
  2. Fill in the name, description and website fields. Don’t worry too much about this: put in whatever you like for the name and description (e.g. “My example application” and “Tutorial app for me”). For the website, give the URL of your Twitter account if you don’t have anything better to use.
  3. A new screen will come up for your application. Click on the button at the bottom that says “Create my access token”.
  4. Click on the “OAuth tool” tab and you’ll see four fields for authentication which you need in order to use twitter4j to access tweets and other information from Twitter: Consumer key, Consumer secret, Access token, and Access token secret.

Based on these authorization details, you now need to create a twitter4j.conf.Configuration object that will allow twitter4j to access the Twitter API on your behalf. This can be done in a number of different ways, including environment variables, properties files, and in code. To keep it as simple as possible for this tutorial, we’ll go with the latter option.

Add the following object after the definition of StatusStreamer, providing your details rather than the descriptions given below.

object Util {
  val config = new twitter4j.conf.ConfigurationBuilder()
    .setOAuthConsumerKey("[your consumer key here]")
    .setOAuthConsumerSecret("[your consumer secret here]")
    .setOAuthAccessToken("[your access token here]")
    .setOAuthAccessTokenSecret("[your access token secret here]")

You should of course be careful not to let your details be known to others, so make sure that this code stays on your machine. When you start developing for real, you’ll use other means to get the authorization information injected into your application.

Pulling tweets from the sample stream

In the previous tutorial, the most basic sort of access was to get a random sample of tweets from, so let’s use twitter4j to do the same.

To do this, we are going to create a TwitterStream instance that gives us an authorized connection to the Twitter API. To see all the methods associated with the TwitterStream class, see the API documentation for TwitterStream.  A TwitterStream instance is able to get tweets (and other information) and then provide them to any listeners that have registered with it. So, in order to do something useful with the tweets, you need to implement the StatusListener interface and connect it to the TwitterStream.

Before showing the code for creating and using the stream, let’s create a StatusListener that will perform a simple action based on tweets streaming in. Add the following code to the Util object created earlier.

def simpleStatusListener = new StatusListener() {
  def onStatus(status: Status) { println(status.getText) }
  def onDeletionNotice(statusDeletionNotice: StatusDeletionNotice) {}
  def onTrackLimitationNotice(numberOfLimitedStatuses: Int) {}
  def onException(ex: Exception) { ex.printStackTrace }
  def onScrubGeo(arg0: Long, arg1: Long) {}
  def onStallWarning(warning: StallWarning) {}

This method creates objects that implement StatusListener (though it only does something useful for the onStatus method and otherwise ignores all other events sent to it). Clearly, what it is going to do is take a Twitter status (which is all of the information associated with a tweet, including author, retweets, geographic coordinates, etc) and output the text of the status—i.e., what we usually think of as a “tweet”.

The following code puts it all together. We create a TwitterStream object by using the TwitterStreamFactory and the configuration, add a simpleStatusListener to the stream, and then call the sample method of TwitterStream to start receiving tweets. If that were the last line of the program, it would just keep receiving tweets until the process was killed. Here, I’ve added a 2 second sleep so that we can see some tweets, then clean up the connection and shut it down cleanly. (We could let it run indefinitely, but then to kill the process, we would need to use CTRL-C, which will kill not only that process, but also the process that is running SBT.)

object StatusStreamer {
  def main(args: Array[String]) {
    val twitterStream = new TwitterStreamFactory(Util.config).getInstance

To run this code, simply put in the same run-main command in SBT as before.

> run-main bcomposes.twitter.StatusStreamer

You should see tweets stream by for a couple of seconds and then you’ll be returned to the SBT prompt.

Pulling tweets with specific properties

As with the HTTP streaming, it’s easy to use twitter4j to follow a particular set of users, particular search terms, or tweets produced within certain geographic regions. All that is required is creating appropriate FilterQuery objects and then using the filter method of TwitterStream rather than the sample method.

FilterQuery has several constructors, one of which allows an Array of Long values to be provided, each of which is the id of a Twitter user who is to be followed by the stream. (See the previous tutorial to see one easy way to get the id of a user based on their username.)

object FollowIdsStreamer {
  def main(args: Array[String]) {
    val twitterStream = new TwitterStreamFactory(Util.config).getInstance
    twitterStream.filter(new FilterQuery(Array(1344951,5988062,807095,3108351)))

These are the IDs for Wired Magazine (@wired), The Economist (@theeconomist), the New York Times (@nytimes), and the Wall Street Journal (@wsj). Add the code to TwitterStream.scala and then run it in SBT. Note that I’ve made the program sleep for 10 seconds in order to give more time for tweets to arrive (since these are just four accounts and will have varying activity). If you are not seeing anything show up, increase the sleep time.

> run-main bcomposes.twitter.FollowIdsStreamer

To track tweets that contain particular terms, create a FilterQuery with the default constructor and then call the track method with an Array of strings that contains the query terms you are interested in. The object below does this, and uses the args Array as the container for the query terms.

object SearchStreamer {
  def main(args: Array[String]) {
    val twitterStream = new TwitterStreamFactory(Util.config).getInstance
    twitterStream.filter(new FilterQuery().track(args))

With things set up this way, you can track arbitrary queries by specifying them on the command line.

> run-main bcomposes.twitter.SearchStreamer scala
> run-main bcomposes.twitter.SearchStreamer scala python java
> run-main bcomposes.twitter.SearchStreamer "sentiment analysis" "machine learning" "text analytics"

If the search terms are not particularly common, you’ll need to increase the sleep time.

To filter by location, again create a FilterQuery with the default constructor, but then use the locations method, with an Array[Array[Double]] argument — basically an Array of two-element Arrays, each of which contains the latitude and longitude of a corner of a bounding box. Here’s an example that creates bounding box for Austin and uses it.

object AustinStreamer {
  def main(args: Array[String]) {
    val twitterStream = new TwitterStreamFactory(Util.config).getInstance
    val austinBox = Array(Array(-97.8,30.25),Array(-97.65,30.35))
    twitterStream.filter(new FilterQuery().locations(austinBox))

To make things more flexible, we can take the bounding box information on the command line, convert the Strings into Doubles and pair them up.

object LocationStreamer {
  def main(args: Array[String]) {
    val boundingBoxes =
    val twitterStream = new TwitterStreamFactory(Util.config).getInstance
    twitterStream.filter(new FilterQuery().locations(boundingBoxes))

We can call LocationStreamer with multiple bounding boxes, e.g. as follows for Austin, San Francisco, and New York City.

> run-main bcomposes.twitter.LocationStreamer -97.8 30.25 -97.65 30.35 -122.75 36.8 -121.75 37.8 -74 40 -73 41


This shows the start of how you can use twitter4j with Scala for streaming. It also supports programmatic access to the actions that any Twitter user can take, including posting messages, retweeting, following, and more. I’ll cover that in a later tutorial. Also, some examples of using twitter4j will start showing up soon in the tshrldu project.

Topics: Unix,spelling,tr,sort,uniq,find,awk


We can of course write programs to do most anything we want, but often the Unix command line has everything we need to perform a series of useful operations without writing a line of code. In my Applied NLP class today, I show how one can get a high-confidence dictionary out of a body of raw text with a series of Unix pipes, and I’m posting that here so students can refer back to it later and see some pointers to other useful Unix resources.

Note: for help with any of the commands, just type “man <command>” at the Unix prompt.

Checking for spelling errors

We are working on automated spelling correction as an in-class exercise, with a particular emphasis on the following sentence:

This Facebook app shows that she is there favorite acress in tonw

So, this has a contextual spelling error (there), an error that could be a valid English word but isn’t (acress) and an error that violates English sound patterns (tonw).

One of the key ingredients for spelling correction is a dictionary of words known to be valid in the language. Let’s assume we are working with English here. On most Unix systems, you can pick up an English dictionary in /usr/share/dict/words, though the words you find may vary from one platform to another. If you can’t find anything there, there are many word lists available online, e.g. check out the Wordlist project for downloads and links.

We can easily use the dictionary and Unix to check for words in the above sentence that don’t occur in the dictionary. First, save the sentence to a file.

$ echo "This Facebook app shows that she is there favorite acress in tonw" > sentence.txt

Next, we need to get the unique word types (rather than tokens) is sorted lexicographic order. The following Unix pipeline accomplishes this.

$ cat sentence.txt | tr ' ' '\n' | sort | uniq > words.txt

To break it down:

  •  The cat command spills the file to standard output.
  • The tr command “translates” all spaces to new lines. So, this gives us one word per line.
  • The sort command sorts the lines lexicographically.
  • The uniq command makes those lines uniq by making adjacent duplicates disappear. (This doesn’t do anything for this particular sentence, but I’m putting it in there in case you try other sentences that have multiple tokens of the type “the”, for example.)

You can see these effects by doing each in turn, building up the pipeline incrementally.

$ cat sentence.txt
This Facebook app shows that she is there favorite acress in tonw
$ cat sentence.txt | tr ' ' '\n'
$ cat sentence.txt | tr ' ' '\n' | sort

Note: the use of cat above is a UUOC (unnecessary use of cat) that is dispreferred to sending the input directly into tr at the start. I do it this way in the tutorial so that everything flows left-to-right. However, if you want to avoid cat abuse, here’s how you’d do it.

$ tr ' ' '\n' < sentence.txt | sort | uniq

We can now use the comm command to compare the file words.txt and the dictionary. It produces three columns of output: the first gives the lines only in the first file, the second are lines only in the second file, and the third are those in common. So, the first column has what we need, because those are words in our sentence that are not found in the dictionary. Here’s the command to get that.

$ comm -23 words.txt /usr/share/dict/words

The -23 options indicate we should suppress columns 2 and 3 and show only column 1. If we just use -2, we get the words in the sentence with the non-dictionary words on the left and the dictionary words on the right (try it).

The problem of course is that any word list will have gaps. This dictionary doesn’t have more recent terms like Facebook and app. It also doesn’t have upper-case This. You can ignore case with comm using the -i option and this goes away. It doesn’t have shows, which is not in the dictionary since it is an inflected form of the verb stem show. We could fix this with some morphological analysis, but instead of that, let’s go the lazy route and just grab a larger list of words.

Extracting a high-confidence dictionary from a corpus

Raw text often contains spelling errors, but errors don’t tend to happen with very high frequency, so we can often get pretty good expanded word lists by computing frequencies of word types on lots of text and then applying reasonable cutoffs. (There are much more refined methods, but this will suffice for current purposes.)

First, let’s get some data. The Open American National Corpus has just released v3.0.0 of its Manually Annotated Sub-Corpus (MASC), which you can get from this link.

Do the following to get it and set things up for further processing:

$ mkdir masc
$ cd masc
$ wget
$ tar xzf MASC-3.0.0.tgz

(If you don’t have wget, you can just download the MASC file in your browser and then move it over.)

Next, we want all the text from the data/written directory. The find command is very handy for this.

$ find data/written -name "*.txt" -exec cat {} \; > all-written.txt

To see how much is there, use the wc command.

$ wc all-written.txt
   43061  400169 2557685 all-written.txt

So, there are 43k lines, and 400k tokens. That’s a bit small for what we are trying to do, but it will suffice for the example.

Again, I’ll build up a Unix pipeline to extract the high-confidence word types from this corpus. I’ll use the head command to show just part of the output at each stage.

Here are the raw contents.

$ cat all-written.txt | head

I can't believe I wrote all that last year.

Friday, 07 May 2010

Now, get one word per line.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | head


The tr translator is used very crudely: basically, anything that is not an ASCII letter character is turned into a new line. The -cs options indicate to take the complement (opposite) of the ‘A-Za-z’ argument and to squeeze duplicates (e.g. A42, becomes A with a single new line rather than three).

Next, we sort and uniq, as above, except that we use the -c option to uniq so that it produces counts.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | head
 737 A
  22 AA
   1 AAA
   1 AAF
   1 AAPs
  21 AB
   3 ABC
   1 ABLE

Because the MASC corpus includes tweets and blogs and other unedited text, we don’t trust words that have low counts, e.g. four or fewer tokens of that type. We can use awk to filter those out.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' | head

Awk makes it easy to process lines of files, and gives you indexes into the first column ($1), second ($2), and so on. There’s much more you can do, but this shows how you can conditionally output some information from each line using awk.

You can of course change the threshold. You can also turn all words to lower-case by inserting another tr call into the pipe, e.g.:

$ cat all-written.txt | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n' | sort | uniq -c | awk '{ if($1>8) print $2 }' | head

It all comes down to what you need out of the text.

Combining and using the dictionaries

Let’s do the check on the sentence above, but using both the standard dictionary and the one derived from MASC. Run the following command first.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' > /tmp/masc_vocab.txt

Then in the directory where you saved words.txt, do the following.

$ cat /usr/share/dict/words /tmp/masc_vocab.txt | sort | uniq > big_vocab.txt
$ comm -23 words.txt big_vocab.txt

Ta-da! The MASC corpus provided us with enough examples of other words that This, Facebook, app, and shows are no longer detected as errors. Of course, detecting there as an error is much more difficult and requires language models and more.


Learn to use the Unix command line! This post is just a start into many cool things you can do with Unix pipes. Here are some other resources:

Happy (Unix) hacking!

Topics: Twitter, streaming API


Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Accessing a random sample of tweets

First, trying pulling a random sample of tweets using your browser by going to the following link.

You should see a growing, unwieldy list of raw tweets flowing by. It should look something like the following image.


Here’s an example of a “raw” tweet (which comes in JSON, or JavaScript Object Notation):

{"text":"#LetsGoMavs til the end RT @dallasmavs: Are You ALL IN?","truncated":false,"retweeted":false,"geo":null,"retweet_count":0,"source":"web","in_reply_to_status_id_str":null,"created_at":"Wed Apr 25 15:47:39 +0000 2012","in_reply_to_user_id_str":null,"id_str":"195177260792299521","coordinates":null,"in_reply_to_user_id":null,"favorited":false,"entities":{"hashtags":[{"text":"LetsGoMavs","indices":[0,11]}],"urls":[],"user_mentions":[{"indices":[27,38],"screen_name":"dallasmavs","id_str":"22185437","name":"Dallas Mavericks","id":22185437}]},"contributors":null,"user":{"show_all_inline_media":true,"statuses_count":3101,"following":null,"profile_background_image_url_https":"https:\/\/\/profile_background_images\/285480449\/AAC_med500.jpg","profile_sidebar_border_color":"eeeeee","screen_name":"flyingcape","follow_request_sent":null,"verified":false,"listed_count":2,"profile_use_background_image":true,"time_zone":"Mountain Time (US &amp; Canada)","description":"HUGE ROCKETS &amp; MAVS fan. Lets take down the Lakers &amp; beat up on the East. Inaugural member of the FC Dallas – Fort Worth fan club.","profile_text_color":"333333","default_profile":false,"profile_background_image_url":"http:\/\/\/profile_background_images\/285480449\/AAC_med500.jpg","created_at":"Thu Oct 21 15:40:21 +0000 2010","is_translator":false,"profile_link_color":"1212cc","followers_count":35,"url":null,"profile_image_url_https":"https:\/\/\/profile_images\/1658982184\/204970_10100514487859080_7909803_68807593_5366704_o_normal.jpg","profile_image_url":"http:\/\/\/profile_images\/1658982184\/204970_10100514487859080_7909803_68807593_5366704_o_normal.jpg","id_str":"205774740","protected":false,"contributors_enabled":false,"geo_enabled":true,"notifications":null,"profile_background_color":"0a2afa","name":"Mandy","default_profile_image":false,"lang":"en","profile_background_tile":true,"friends_count":48,"location":"ATX \/ FDub. From Galveston !","id":205774740,"utc_offset":-25200,"favourites_count":231,"profile_sidebar_fill_color":"efefef"},"id":195177260792299521,"place":{"bounding_box":{"type":"Polygon","coordinates":[[[-97.938383,30.098659],[-97.56842,30.098659],[-97.56842,30.49685],[-97.938383,30.49685]]]},"country":"United States","url":"http:\/\/\/1\/geo\/id\/c3f37afa9efcf94b.json","attributes":{},"full_name":"Austin, TX","country_code":"US","name":"Austin","place_type":"city","id":"c3f37afa9efcf94b"},"in_reply_to_screen_name":null,"in_reply_to_status_id":null}

There is a lot of information in there beyond the tweet text itself, which is simply “#LetsGoMavs til the end RT @dallasmavs: Are You ALL IN?” It is basically a map from attributes to values (and values may themselves be such a map, e.g. for the “user” attribute above). You can see whether the tweet has been retweeted (which will be zero when the tweet is first published), what time it was created, the unique tweet id, the geo-coordinates (if available), and more. If an attribute does not have a value for the tweet, it is ‘null’.

I will return to JSON processing of tweets in a later tutorial, but you can get a head start by seeing my tutorial on using Scala to process JSON in general.

Command line access to tweets

Assuming you were successful in being able to view tweets in the browser, we can now proceed to using the command line. For this, it will be convenient to first set environment variables for your Twitter username and password.

$ export TWUSER=foo
$ export TWPWD=bar

Obviously, you need to provide your Twitter account details instead of foo and bar…

Next, we’ll use the program curl to interact with the API. Try it out by downloading this blog post.

$ curl > bcomposes-twitter-api.html
$ less bcomposes-twitter-api.html

Given that you pulled tweets from the API using your web browser, and that curl can access web pages in this way, it is simple to use curl to get tweets and direct them straight to a file.

$ curl -u$TWUSER:$TWPWD > tweets.json

That’s it: you now have an ever-growing file with randomly sampled tweets. Have a look and try not to lose your faith in humanity. ;)

Pulling tweets with specific properties

You might want to get the tweets from specific users rather than a random sample. This requires user ids rather than the user names we usually see. The id for a user can be obtained from the Twitter API by looking at the /users/show endpoint. For example, the following gives my information:

Which gives:

<name>Jason Baldridge</name>
<location>Austin, Texas</location>
Assoc. Prof., Computational Linguistics, UT Austin. Senior Data Scientist, Converseon. OpenNLP developer. Scala, Java, R, and Python programmer.

So, to follow @jasonbaldridge via the Twitter API, you need user id 119837224. You can pull my tweets via the API using the “follow” query parameter.

$ curl -d follow=119837224 -u$TWUSER:$TWPWD

There is a good chance I’m not tweeting right now, so you’ll probably not see anything. Let’s follow more users, which we can do by adding more id’s separated by commas.

$ curl -d follow=1344951,5988062,807095,3108351 -u$TWUSER:$TWPWD

This will follow Wired Magazine (@wired), The Economist (@theeconomist), the New York Times (@nytimes), and the Wall Street Journal (@wsj).

You can also write those ids to a file and read them from the file. For example:

$ echo "follow=1344951,5988062,807095,3108351" > following
$ curl -d @following -u$TWUSER:$TWPWD

You can of course edit the file “following” rather than using echo to create it. Also, the file name can be named whatever you like (“following” as the name is not important here).

You can search for a particular term in tweets, such as “Scala”, using the “track” query parameter.

$ curl -d track=scala -u$TWUSER:$TWPWD

And, no surprise, you can search for multiple items by using commas to separate them.

$ curl -d track=scala,python,java -u$TWUSER:$TWPWD

However, this only requires that a tweet match at least one of these terms. If you want to ensure that multiple terms match, you’ll need to write them to a file and then refer to that file. For example, to get tweets that have both “sentiment” and “analysis” OR both “machine” and “learning” OR both “text” and “analytics”, you could do the following:

$ echo "track=sentiment analysis,machine learning,text analytics" > tracking
$ curl -d @tracking -u$TWUSER:$TWPWD

You can pull tweets from a specific rectangular area (bounding box) on the Earth’s surface. For example, the following pulls geotagged tweets from Austin, Texas.

$ curl -d locations=-97.8,30.25,-97.65,30.35 -u$TWUSER:$TWPWD

The bounding box is given as latitude (bottom left), longitude (bottom left), latitude (top right), longitude (top right). You can add further bounding boxes to capture more locations. For example, the following captures tweets from Austin, San Francisco, and New York City.

$ curl -d locations=-97.8,30.25,-97.65,30.35,-122.75,36.8,-121.75,37.8,-74,40,-73,41 -u$TWUSER:$TWPWD


It’s all pretty straightforward, and quite handy for many kinds of tweet-gathering needs. One of the problems is that Twitter will drop the connection at times, and you’ll end up missing tweets until you start a new process. If you need constant monitoring,  see UT Austin’s Twools (Twitter tools) for obtaining a steady stream of tweets that picks up whenever Twitter drops your connection.

In a later post, I’ll detail how to use an API like twitter4j to pull tweets and interact with Twitter at a more fundamental level.

Several years ago, I did an implementation of a Gibbs sampler in R for the artificial data of Steyvers and Griffiths (2007) “Probabilistic topic models” that I used for a class demo and have been meaning to post as a Github gist. Here it is:

The artificial problem provides a very nice, simple test case for seeing the inference of the topic-word and document-topic distributions using Gibbs sampling.  The code for the sampling is shorter than the setup code. There are comments in the code that should make everything self explanatory if you read Steyvers and Griffiths.

To run it, you can of course just paste it into an R session. You can also run it from the command line, e.g.:

$ R --no-save < topics_gibbs_sg_example.R

If you are interested in other tutorials that discuss Bayesian learning and samplers (with a definite slant toward natural language processing), check these out:


Get every new post delivered to your Inbox.

Join 2,972 other followers