Shaun King has been one of the most visible and vocal leaders of the Black Lives Matter movement over the past year. He’s done a great deal to raise awareness in particular of police misconduct and brutality, with a particular emphasis on the disproportionate targeting of black Americans. (Though it is worth noting that he and others have noted when people of other races have been killed by police, even while the supposed #AllLivesMatter folks seemed oddly silent.)

An unsurprising development as regards to Black Lives Matter is that its leaders are coming under character attacks. There is a long tradition of privileged segments of society and even the government doing this, including to the now revered and respected Martin Luther King. And now, Shaun King has recently come under a very odd sort of attack from the conservative media: they are now saying he is not truly black and are accusing him of being duplicitous, like Rachel Dolezal.

This issue has a particular resonance for me because my family is a tangible example of the complexities of the concept of race. I’m white and my wife is black. We have two sons, both our biological children. The picture on the right is of our four hands. family_handsOur older son has darker skin and dark curly hair. He’s absolutely beautiful. Most people see him and think of him as “black”. Our younger son has light skin and blonde hair with just a hint of curl. He’s absolutely beautiful. Most people see him and think of him as “white”. In fact, when my wife is in public with our younger son (and without me), most people assume she’s his nanny. (And white people monitor her to make sure she’s treating him well, but that’s another story.)

So, we have these two children who are perceived very differently by others. Are you to tell me that the younger one isn’t “black” or is less “black” than his older brother? Just like Shaun King isn’t black because his skin is too light? What if both of my sons strongly identify with their black heritage and become leaders of some future “black” movement that seeks to reduce racial disparities? Would my younger son be attacked for not being “black” enough? With his older brother standing right by his side and no one questioning his blackness? One gets to speak for the black community because the genetic dice gave him the darker skin and hair, while the other is unsuitable? That would be pure and utter bullshit.

Let’s step back for a moment. It’s important to consider what “race” means and how any given individual might define it differently from others. And that one’s own notion of racial categories might shift over time, as applied to others or even to oneself. Can we even operationalize racial categories? It’s rather tricky. I wrote about this in the context of machine learning, and there’s good recent academic work on figuring out what the notion of race fundamentally encompasses. As Sen and Wasow argue in their article “Race as a bundle of sticks“, we should look at race as a multi-faceted group of properties, some of which are immutable (like genes) and many of which are mutable (such as location, religion, diet, etc). The very notion of racial categorization shifts over time—for example, there was a time not long ago when southern Europeans were not considered “white”. All this is not to say that race isn’t a thing, but that it is very very complicated. In fact, it is far more complicated than most people have ever stopped to really consider.

Returning to the attacks on Shaun King, here’s the thing: I personally don’t care if he is “black” or not, or is somewhat “black” or not. He could be Asian or white and it wouldn’t matter. I think he is doing what he’s doing because he is a caring human being who believes it is right and necessary. He wants to raise awareness of and reduce police violence and reduce racial disparities. That’s a laudable goal no matter who you are, no matter what race you identify with, no matter what. Period.

To me, this is clearly an ad hominem attack based on the flawed premise that race is a concept that we can clearly and objectively delineate. It has nothing to do with the facts and arguments that surround questions of racism in the USA, police conduct and related issues. There is plenty to debate there and, for what it’s worth, I don’t agree with Shaun King on many things. We all must do our best to learn, consider and reflect on the information we have. Ideally, we also seek new perspectives and keep an open mind while doing so. As it is, this attack is a distraction designed to deflect attention away from the real issues. It’s just smoke and mirrors.

And if you think there aren’t real issues here… Ask yourself if you think our country should support a truly Kafka-esque institution like Ryker’s Island. Ask yourself if you are comfortable with the Sandra Bland traffic stop (even FOX news and Donald Trump aren’t, as Larry Wilmore noted). Ask yourself if people should be threatened by the police when they are in their own driveway, hooking up their own boat to their own car. Ask whether the police should be outfitted with military-grade vehicles and weapons (see also John Oliver’s serious/humorous take on this). These are just a few (important) examples, and there are unfortunately many more. They do not reflect the United States of America that I believe in—a great country based on a civil society that protects the rights of individuals without prejudice for their race, religion, political affiliation, etc. You are ignoring much evidence if you think there isn’t a problem. Pay attention, please.

This is a horrible video of police in McKinney, Texas treating a bunch of kids — I stress, KIDS — at a pool party in a very heavy-handed way, way out of proportion to the situation (the “incident”). One officer, Eric Casebolt, pulls his gun as a threat (and he is now on leave because of it). Kids who had nothing to do with the situation are handcuffed, yelled at, and called motherfuckers. I can’t imagine this happening at a similar party in my (almost entirely white) hometown of Rockford, Michigan.

For more context, see this article.

I find this all very upsetting, and I took up Joshua Dubois’ suggestion to write to the police chief. My letter is below.

Dear Police Chief Conley,

I’m writing to express my extreme disapproval and concern regarding the incident in McKinney involving very heavy-handed behavior by police, and in particular Corporal Eric Casebolt, against a group of teens.

I have reviewed the videos and read many different reports on the matter, and I realize that there may be more information yet to come to light. Regardless of how things transpired prior to the police force arriving, the actions of Corporal Casebolt are incredibly disturbing: yanking a 14-year-old girl by her hair, pinning her to the ground, chasing other teens with a gun, and swearing and cursing at teens. Many of the teens were interacting very respectfully, yet he tells them to “sit your asses down on the ground”. Many of the other teens appear incredibly scared — wanting to help their friends, but not wanting to escalate the situation (which is probably wise given recent events in the country and Corporal Casebolt’s disposition and his brandishing of his gun).

This is not behavior befitting an officer of the law. I fully realize that the police have an important and difficult job to do, and I’m thankful to those who serve and keep the peace. I believe a big part of that job is to show respect to the people that the police serve, and to apply rules and force consistently, regardless of the age, race, or socio-economic status of the individuals involved. Sadly, recent events in the country, including Saturday’s incident in McKinney, indicate that this is far from the case currently.

I’m not writing this just as a concerned citizen from afar. I live in Austin, Texas. My wife is African-American and we have two biracial sons, currently two and six years old. My six year old likes dinosaurs, tennis, and math. He’s going to do amazing things, but I fear that society—including the authorities—will view him as a threat by the time he becomes a teenager in 2022. My wife has family who live in Lewisville, less than 30 minutes from McKinney. If my son goes to a pool party with his cousin in seven years, should I worry that he will be handcuffed just for being present? And that no matter how polite and respectful he is, he’ll be told to sit his ass down? I certainly hope not, but seven years isn’t very much time. I sincerely hope that you and others in similar positions will do whatever you can to help reduce the likelihood of these sorts of incidents and to ensure that the members of the police force are respectful of the rights of all citizens. A good start to this would be for you to dismiss Corporal Casebolt.


Dr. Jason Baldridge

Associate Professor of Computational Linguistics, The University of Texas at Austin

Co-founder and Chief Scientist, People Pattern

I’m not at all sure it will do any good, but it’s a start to trying to effect some change. If you feel the same, please consider writing, and getting involved. Follow Shaun King and Deray McKeeson for much much more on what is going on with the police and racism. We need to find a better way forward, as a society.

Bozhidar Bozhanov wrote an blog post titled “The Low Quality of Academic Code“, in which he observed that most academic software is poorly written. He’s makes plenty of fair points, e.g.:

… there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

But, here’s the thing — I would argue that this lack of engineering quality in academic software is a feature, not a bug. For academics, there is basically little to no incentive to produce high quality software, and that is how it should be. Our currency is ideas and publications based on them, and those are obtained not by creating wonderful software, but by having great results. We have limited time, and that time is best put into thinking about interesting models and careful evaluation and analysis. The code is there to support that, and is fine as long as it is correct.

The truly important metric for me is whether the code supports replication of results from the paper it supports. The code can be as ugly as you can possibly imagine as long as it does this. Unfortunately, a lot of academic software doesn’t make replication easy. Nonetheless, having the code open sourced makes it at least possible to hack with it to try to replicate previous results. In the last few years, I’ve personally put a lot of effort into having my work and my students’ work easy to replicate. I’m particularly proud of how I put code, data and documentation together for a paper I did on topic model evaluation with James Scott for AISTATS in 2013, “A recursive estimate for the predictive likelihood in a topic model.” That was a lot of work, but I’ve already benefited from it myself (in terms of being able to get the data and run my own code). Check out the “code” links in some of my other papers for some other examples that my students have done for their research.

Having said the above, I think it is really interesting to see how people who have made their code easy to use (though not always well-engineered) have benefited from doing so in the academic realm. A good example is word2vec and how the software that was released for it generated tons of interest in industry as well as academia and probably led to much wider dissemination of that work, and to more follow on work. Academia itself doesn’t reward that directly, nor should it. That’s one reason you see it coming out of companies like Google, but it might be worth it to some researchers in some cases, especially PhD students who seek industry jobs after they defend their dissertation.

I read an blog post last year in which the author encouraged people to open source their code and not worry about how crappy it was. (I wish I could remember the link, so if you have it, please add in a comment. Here is the post, “It’s okay for your open source library to be a bit shitty.“) I think this is a really important point. We should be careful to not get overly critical about code that people have made available to the world for free—not because we don’t want to damage their fragile egos, but because we want to make sure that people generally feel comfortable open sourcing. This is especially important for academic code, which is often the best recipe, no matter how flawed it might be, that future researchers can use to replicate results and produce new work that meaningfully builds on or compares to that work.

Update: Adam Lopez pointed out this very nice related article by John Regehr “Producing good software from academia“.

Addendum: When I was a graduate student at the University of Edinburgh, I wrote a software package called OpenNLP Maxent (now part of the OpenNLP toolkit, which I also started then and which is still used widely today). While I was still a student, a couple of companies paid me to improve aspects of the code and documentation, which really helped me make ends meet at the time and made the code much better. I highly encourage this model — if there is an academic open source package that you think your company could benefit from, consider hiring the grad student author to make it better for the things that matter for your needs! (Or do it yourself and do a pull request, which is much easier today with Github than it was in 2000 with Sourceforge.)

Update: Thanks to the commenters below for providing the link to the post I couldn’t remember, It’s okay for your open source library to be a bit shitty.! As a further note, the author surprisingly connects this topic to feminism in a cool way.

I’m a longtime fan of Chris Manning and Hinrich Schutze’s “Foundations of Natural Language Processing” — I’ve learned from it, I’ve taught from it, and I still find myself thumbing through it from time to time. Last week, I wrote a blog post on SXSW titles that involved looking at n-grams of different lengths, including unigrams, bigrams, trigrams and … well, what do we call the next one up? Manning and Schutze devoted an entire paragraph to it on page 193 which I absolutely love and thought would be fun to share for those who haven’t seen it.

Before continuing with model-building, let us pause for a brief interlude on naming. The cases of n-gram language models that people usually use are for n=2,3,4, and these alternatives are usually referred to as a bigram, a trigram, and a four-gram model, respectively. Revealing this will surely be enough to cause an Classicists who are reading this book to stop, and leave the field to uneducated engineering sorts: “gram” is a Greek root and so should be put together with Greek number prefixes. Shannon actually did use the term “digram”, but with the declining levels of education in recent decades, this usage has not survived. As non-prescriptive linguists, however, we think that the curious mix of English, Greek, and Latin that our colleagues actually use is quite fun. So we will not try to stamp it out. (1)

And footnote (1) follows this up with a note on four-grams.

1. Rather than “four-gram”, some people do make an attempt at appearing educated by saying “quadgram”, but this is not really correct use of a Latin number prefix (which would be “quadrigram”, cf. “quadrilateral”), let alone correct use of a Greek number prefix, which would give us “a tetragram model.”

In part to be cheeky, I went with “quadrigram” in my post, which was obviously a good choice as it has led to the term being the favorite word of the week for Ken Cho, my People Pattern cofounder, and the office in general. (“Hey Jason, got any good quadrigrams in our models?”)

If you want to try out some n-gram analysis, check out my followup blog post on using Unix, Mallet, and BerkelyLM for analyzing SXSW titles. You can call 4+-grams whatever you like.

Note: This is a repost of a blog post about the Facebook emotional contagion experiment that I wrote on People Pattern’s blog.


This is the first in a series of posts responding to the controversial Facebook study on Emotional Contagion

The past two weeks have seen a great deal of discussion around the recent computational social science study of Kramer, Guillory and Hancock (2014) “Experimental evidence of massive-scale emotional contagion through social networks” . I encourage you to read the published paper before getting caught up in the maelstrom of commentary. The wider issues are critical to address, and I have summarized the often conflicting but thoughtful perspectives below. These issues strike close to home, given our company’s expertise in computational linguistics and reliance on social media.

In this post, I provide a brief description of the original paper itself along with a synopsis of the many perspectives that have been put forth in the past two weeks. This post sets the stage for two posts to follow tomorrow and Tuesday next week that provide our take on the study plus our own Facebook-external opt-in version of the experiment, which anyone currently using Facebook can participate in.

Summary of the study

Kramer, Guillory and Hancock’s paper provides evidence that emotional states as expressed in social media posts are contagious in that they affect whether readers of those posts reflect similar positive or negative emotional states in their own later posts. The evidence is based on an experiment involving about 700,000 Facebook users over a one week period in January 2012. These users were split into four groups: a group that had a reduction in positive messages in their Facebook feed, another that had a reduction in negative messages, a control group that had an overall 5% reduction in posts, and a second control group that had a 2% reduction. Positivity and negativity were determined by using the LIWC word lists. LIWC, which was created and maintained by my University of Texas at Austin colleague James Pennebaker, is a standard resource for psychological studies of emotional expression in language. Over the past two decades, it has been applied to language from varying sources, including speech, essays, and social media.

The study found a small but statistically significant difference in emotional expression between the positive suppression group and the control and the negative suppression group and the control. Basically, users who had positive posts suppressed produced slightly lower rates of positive word usage and slightly higher rates of negative word usage, and the mirror image of this was found for the negative suppression group (check out the plot for these). (This description of the study is short — see Nitin Madnani’s description for more detail and analysis.)

The study was published in PNAS, and then the shit hit the fan.

Objections to the study

Objections to the study and the infrastructure that made it possible have come from many sources. The two major complaints have to do with ethical considerations and research flaws.

The first major criticism is that the study was unethical. The key problem is that there was no informed consent. Facebook users had no idea that they were part of this study and had no opportunity to opt out of it. An important aspect of this is that the study conforms to the Facebook terms of service: Facebook has the right to experiment with feed filtering algorithms as part of improving its service. However, because Jeff Hancock is a Cornell University professor, many state it should have passed Cornell’s IRB process. Furthermore, many feel that Facebook should obtain consent from users when running such experiments, whether for eventual publication or for in-company studies to improve the service. The editors of PNAS itself have issued an editorial expression of concern over the lack of informed consent and opt-out for subjects of the study. We agree this is an issue, so in our third post, we’ll introduce a way this can be achieved through an opt-in version of the study.

The second type of criticism is that the research is flawed or otherwise unconvincing. The most obvious issue is that the effect sizes are small. A subtler problem familiar to anyone who has done anything with sentiment analysis is that counting positive and negative words is a highly imperfect means for judging the positivity/negativity of a text (e.g. it does the wrong thing with negations and sarcasm — see Pang and Lee’s overview). Furthermore, the finding that reducing positive words seen leads to fewer positive words produced does not mean that the user’s actual mood was affected. We will return to this last point in tomorrow’s post.

Support for the study

In response, several authors have joined the discussion to support the study and others similar to it, or to refute some aspects of the criticism leveled at it.

Several commentators have made unequivocal statements that the study would have never obtained IRB approval. This is in fact a misperception: Michelle Meyer provides a great overview of many aspects of IRB approval and concludes that actually this particular study could have legitimately passed the IRB process. A key point for her is that had an IRB approved the study, it would probably be the right decision. She concludes: “We can certainly have a conversation about the appropriateness of Facebook-like manipulations, data mining, and other 21st-century practices. But so long as we allow private entities freely to engage in these practices, we ought not unduly restrain academics trying to determine their effects.”

Another defense is that many concerns expressed about the study are misplaced. Tal Yarkoni argues “In Defense of Facebook” that many critics have inappropriately framed the experimental procedure as injecting positive or negative content into feeds, when in fact it was removal of content. Secondly, he notes that Facebook already manipulates users’ feeds, and this study is essentially business-as-usual in this respect. Yarkoni notes that it is a good thing that Facebook publishes such research: “by far the most likely outcome of the backlash Facebook is currently experiencing is that, in future, its leadership will be less likely to allow its data scientists to publish their findings in the scientific literature.” They will do the work regardless, but the public will have less visibility into the kinds of questions Facebook can ask and the capabilities they can build based on the answers they find.

Duncan Watts takes this to another level, saying that companies like Facebook actually have a moral obligation to conduct such research. He writes in the Guardian that the existence of social networks like Facebook gives us an amazing new platform for social science research, akin to the advent of the microscope. He argues that companies like Facebook, as the gatekeepers of such networks, must perform and disseminate research into questions such how users are affected by the content they see.

Finally, such collaborations between industry and academia should be encouraged. Kate Niederhoffer and James Pennebaker argue that both industry and academy are best served through such collaborations and that the discussion around this study provides an excellent case study. In particular, the backlash against the study highlights the need for more rigor, awareness and openness about the research methods and more explicit informed consent among clients or customers.

Wider issues raised by the study and the backlash against it

The backlash and the above responses have furthermore provided fertile ground for other observations and arguments based on subtler issues and questions that the study and the response to it have revealed.

One of my favorites is the observation that IRBs do not perform ethical oversight. danah boyd argues that the IRB review process itself is mistakenly viewed by many as mechanism for ensuring research is ethical. She makes an insightful, non-obvious argument: that the main function of an IRB is to ensure a university is not liable for the activities of a given research project, and that focusing on questions of IRB approval for the Facebook study is beside the point. Furthermore, the real source of the backlash for her is that there is public misunderstanding and growing negative sentiment for the practice of collecting and analyzing data about people using the tools of big data.

Another point is that the ethical boundaries and considerations between industry and academia are difficult to reconcile. Ed Felten writes that though the study conforms to Facebook’s terms of service, it clearly is inconsistent with the research community’s ethical standards. On one hand, this gap could lead to fewer collaborations between companies and university researchers, while on the other hand it could enable some university researchers to side-step IRB requirements by working with companies. Note that the opportunity for these sorts of collaborations often arise naturally and reasonably frequently; for example, it often happens when a professor’s student graduates and joins such companies, and they continue working together.

Zeynep Tufekci escalates the discussion to much higher level—she argues that companies like Facebook are effectively engineering the public. According to Tufekci, this study isn’t the problem so much as it is symptomatic of the wider issue of how a corporate entity like Facebook has the power to target, model and manipulate users in very subtle ways. In a similar, though less polemical vein, Tartleton Gillespie notes the disconnect between Facebook’s promise to deliver a better experience to its users with how users perceive the role and ability of such algorithms. He notes that this leads to “a deeper discomfort about an information environment where the content is ours but the selection is theirs.”

In a follow up post responding to criticism of his “In Defense of Facebook” post, Tal Yarkoni points out that the real problem is the lack of regulations/frameworks for what can be done with online data, especially that collected by private entities like Facebook. He suggests the best thing is to reserve judgment with respect to questions of ethics for this particular paper, but that the incident does certainly highlight the need for “a new set of regulations that provide a unitary code for dealing with consumer data across the board–i.e., in both research and non-research contexts.”

Perhaps the most striking thing about the Kramer, Guillory and Hancock paper is how the ensuing discussion has highlighted many deep and important aspects of the ethics of research in computational social science from both industry and university perspectives, and the subtleties that lie therein.

Summing up

A standard blithe rejoinder to users of services like Facebook who express concern, or even horror, about studies like this is to say “Don’t you see that when you use a service you don’t pay for, you are not the customer, you are the product?” This is certainly true in many ways, and it merits repeating again and again. However, it of course doesn’t absolve corporations from the responsibility to treat their users with respect and regard for their well-being.

I don’t think the researchers nor Facebook itself have been grossly negligent with respect to this study, but nonetheless the study is in an ethical gray zone. Our second post will touch on other activities, such as A/B testing in ad placement and content, that are arguably in that same gray zone, but which have not created a public outcry even after years of being practiced. It will also say more about how the linguistic framing of the study itself essentially primed the extreme backlash that was observed and how it is in many ways more innocuous than its own wording would suggest.

Our third post will introduce our own opt-in version of the study, which we think is a reasonable way to explore the questions posed in the study. We’d love to get plenty of folks to try it out, and we’ll even let participants guess whether they were in the positive or negative group. Stay tuned!

There seems to be a relatively frequent back-in-forth in American society involving one group asking the wider society to stop using a racially charged word in certain contexts, and members of the wider society reacting to this as political correctness or thinking it is just plain wrong. For example, @KaraRBrown posted “Stop calling shit ‘ghetto’”, in which she calls out the increasing use of the word ‘ghetto’ as an adjective for less desirable stuff and strongly recommends people stop using it in that context. To summarize: “… this [using ghetto in this way] is something that can make a seemingly OK person immediately sound like an ignorant, possibly racist asshole. Don’t be that person.”

One response on Twitter to this remarked on how she must mean it is racist toward Jews, and this led to a non-debate with no improvement in mutual understanding. Ms Brown took it as trolling, but it is also indicative of responses I’ve seen elsewhere in similar discussions. I’m fine with viewing it as trolling if you are tired of that kind of typical response, but it can also be viewed as an expression of general misunderstanding about don’t-use-the-word-X-in-certain-contexts requests.

An example that I deal with all the time is the use of the word ‘slave’ in distributed computing contexts, where there is commonly a “master” compute node that is in charge of many “slave” worker compute nodes (e.g. look at systems like Hadoop and Spark). This terminology comes from general use of master and slave in technology. When I started working with Hadoop, I asked my wife (who is black) what she thought of that terminology, and she responded simply that she found it somewhat insensitive and offensive. Mainly, her response was just “Why? There are lots of other good descriptive words one could use instead.” I looked into it a bit, and it turns out there was a bit of a furor over master/slave terminology years ago when, in 2003,  the County of Los Angeles requested that equipment suppliers avoid such terminology on equipment labels. The internet had a conniption about it, with many posters crying foul that this was political correctness gone crazy—even though it was just a polite email request. It is remarkable how vehemently offended some people got by the request and how they went through great lengths to defend the terms as the best ones possible. I’m personally with those who point out that there are many other perfectly good words to describe the relationship, e.g. primary/secondary, supervisor/worker, and that those have the added benefit of not being insensitive. My favorite response (which I unfortunately cannot find the link to now) was something like “we don’t call computer components rapist and victim: let’s not use master and slave either.”

One of the things that was often pointed out regarding master/slave is that the term goes back a long long time and that it neither began nor ended with American slavery, so why should black Americans be bothered by it? And anyway, slavery ended with the Civil War, so why can’t black Americans just get over it? It’s the same thing with the point about ‘ghetto’ being associated with Jews rather than black Americans. These comments ignore context, and the strong associations such terms have for some segments of American society. Context is everything, and unfortunately, American slavery is not out of context — it is the genesis of the struggles for equality that black Americans have faced over the past 150 years. Most white Americans feel it is far, far in the past, but it isn’t such a long time.  Oddly, many white Americans feel that we live in a post racial society, but this is at odds with the experience of many black Americans, and you don’t need to look far to see ugly examples of it right in our faces on Twitter.

Here’s another example of where context matters: a Boston policeman was fired for calling a baseball player a “Monday”. But “Monday” is just a day of the week, so what’s the big deal, right? Well, you know, regular words can be racists slurs in context. Consider this as well: it is still unfortunately common for white people born before the fifties to refer to black men as “boys”. This is highly offensive, even though they may wish no offense and often harbor no explicitly racist views. It’s the echo of times past reverberating through language still used today, and it still has power.

So, now to circle back to the main point. Some people seem to get quite offended by statements like “word X is racist in such-and-such context, so don’t use it that way.” Why? My guess is that quite often the offended person thinks “I’m not racist, but I’ve used that word in that way, so now you are calling me a racist, and that’s just crazy.” They then go on to justify that use of the word or otherwise make the request seem unreasonable. What they seem to be missing is that the original request is not saying that you are racist because you say X, but that it is racially insensitive to do so (and you probably didn’t realize that, so here’s your public service announcement). These are usually reasonable requests (and not calls to ban words, etc), so just consider changing your use of such terms out of respect and good sense.

Topics: twitter,twitter4j,word clouds


My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Getting set up: code and authorization

Rather than having the reader build the code up while going through the tutorial, I’ve set up the code in the repository twitter4j-tutorial. The version needed for this tutorial as v0.2.0. You can download a tarball of that version, which may be easier to work with if there have been further developments to the repository since the writing of this tutorial. Checkout or download that code now. The main file of interest is:

  • src/main/scala/TwitterUser.scala

This tutorial is mainly a walk through for that file in blog form, with some additional pointers and explanations here and there.

You also need to set up the authorization details. See “Setting up authorization” section of the previous post to do this if you haven’t already.


IMPORTANT: for this tutorial you must set the permissions for your application to be “Read and Write“. This does NOT mean to use ‘chmod’. It means going to the Twitter developers application site, signing in with your Twitter account, clicking on “Settings” and setting the permissions to read and write.


In the previous tutorial, authorization details were put into code. This time, we’ll use a file. This is easy: just add a file with that name to the twitter4j-tutorial directory with the following contents, substituting your details as appropriate.

oauth.consumerKey=[your consumer key here]
oauth.consumerSecret=[your consumer secret here]
oauth.accessToken=[your access token here]
oauth.accessTokenSecret=[your access token secret here]

Rate limits and a note of caution

Unlike streaming access to Twitter, performing user actions via the API is subject to rate limits. Once you hit your limit, Twitter will throw an exception and refuse to comply with your requests until a period of time has passed (usually 15 minutes). Twitter does this to limit bad bots and also preserve their computational resources. For more information on rate limits, see Twitter’s page about rate limiting.

I’ll discuss how to manage rate limits later in the post, but I mention them up front in case you exceed them while messing around with things early on.

A word of caution is also in order: since you are going to be able to take actions automatically, like following users, posting a status, and retweeting, you could end up doing many of these actions in rapid succession. This will (a) use up your rate limit very quickly, (b) probably not be interesting behavior, and (c) could get your account suspended. Make sure to follow the rules, especially those on following users.

If you are going to mess around quite a bit with actual posting, you may also want to consider creating an account that is not your primary Twitter account so that you don’t annoy your actual followers. (Suggestion: see the paragraph on “Create account” in part one of project phase one of my Applied NLP course for tips on how to add multiple accounts with the same gmail address.)

Basic interactions: searching, timelines, posting

All of the examples belowe are implemented as objects with main methods that do something using a twitter4j.Twitter object. To make it so we don’t have to call the TwitterFactory repeatedly, we first define a trait that gets a Twitter instance set up and ready to use.

trait TwitterInstance {
  val twitter = new TwitterFactory().getInstance

By extending this trait, our objects can access the twitter object conveniently.

As a first simple example, we can search for tweets that match a query by using the search method. The following object takes a query string given on the command line query, searches for tweets using that query, and prints them.

object QuerySearch extends TwitterInstance {

  def main(args: Array[String]) {
    val statuses = Query(args(0))).getTweets
    statuses.foreach(status => println(status.getText + "\n"))


Note that this uses a Query object, whereas with using a TwitterStream, a FilterQuery was needed. Also, for this to work, we must have the following import available:

import collection.JavaConversions._

This ensures that we can use the java.util.List returned by the getTweets method (of twitter4j.QueryResult) as if it were a Scala collection with the method foreach (and map, filter, etc). This is done via implicit conversions that make working with Java libraries far nicer than it would be otherwise.

To run this, go to the twitter4j-tutorial directory, and do the following (some example output shown):

$ ./build
> run-main bcomposes.twitter.QuerySearch scala
[info] Running bcomposes.twitter.QuerySearch scala
E' avvilente non sentirsi all'altezza di qualcosa o qualcuno, se non si possiede quella scala interiore sulla quale l'autostima pu? issarsi

Scala workshop will run with ECOOP, July 2nd in Montpellier, South of France. Call for papers is out.

#scala Even two of them in #cologne #germany . #thumbsup

RT @MILLIB2DAL: @djcameo Birthday bash 30th march @ Scala nightclub 100 artists including myself make sur u reach its gonna be #Legendary

@kot_2010 I think it's the same case with Scala: with macros it will tend to "outsource" things to macro libs, keeping a small lang core.

RT @waxzce: #scala hiring or job ? go there :

@esten That's not only a front-end problem. Scala devs should use scalaz.Equal and === for type safe equality. /cc @sharonw


[success] Total time: 1 s, completed Feb 26, 2013 1:54:44 PM

You might see some extra communications from SBT, which will probably need to download dependencies and compile the code. For the rest of the examples below, you can run them in a similar manner, substituting the right object name and providing any necessary arguments.

There are various timelines available for each user, including the home timeline, mentions timeline, and user timeline. They are accessible as twitter4j.api.TimelineResources. For example, the following object shows the most recent statuses on the authenticating user’s home timeline (which are the tweets by people the user follows).

object GetHomeTimeline extends TwitterInstance {

  def main(args: Array[String]) {
    val num = if (args.length == 1) args(0).toInt else 10
    val statuses = twitter.getHomeTimeline.take(num)
    statuses.foreach(status => println(status.getText + "\n"))


The number of tweets to show is given as the command-line argument.

You can also update the status of the authenticating user from the command line using the following object. Calling it will post to the authenticating user’s account (so only do it if you are comfortable with the command-line argument you give it going onto your timeline).

object UpdateStatus extends TwitterInstance {
  def main(args: Array[String]) {
    twitter.updateStatus(new StatusUpdate(args(0)))

There are plenty of other useful methods that you can use to interact with Twitter, and if you have successfully run the above three, you should be able to look at the Twitter4j javadocs and start using them. Some examples doing more interesting things are given below.

Replying to tweets written to you

The following object goes through the most recent tweets that have mentioned the authenticating user, and replies “OK.” to them. It includes the author of the original tweet and any other entities that were mentioned in it.

object ReplyOK extends TwitterInstance {

  def main(args: Array[String]) {
    val num = if (args.length == 1) args(0).toInt else 10
    val userName = twitter.getScreenName
    val statuses = twitter.getMentionsTimeline.take(num)
    statuses.foreach { status => {
      val statusAuthor = status.getUser.getScreenName
      val mentionedEntities =
      val participants = (statusAuthor :: mentionedEntities).toSet - userName
      val text =>"@"+p).mkString(" ") + " OK."
      val reply = new StatusUpdate(text).inReplyToStatusId(status.getId)
      println("Replying: " + text)


This should be mostly self-explanatory, but there are a couple of things to note. First, you can find all the entities that have been mentioned (via @-mentions) in the tweet via the method getUserMentionEntities of the twitter4j.Status class. The code ensures that the author of the original tweet (who isn’t necessarily mentioned in it) is included as a participant for the response, and also we take out the authenticating user. So, if the message “@tshrdlu What do you think of @tshrdlc?” is sent from @jasonbaldridge, the response will be “@jasonbaldridge @tshrdlc OK.” Note how the screen names do not have the @ symbol, so that must be added in the tweet text of the reply.

Second, notice that StatusUpdate objects can be created by chaining methods that add more information to them, e.g. setInReplyToStatusId and setLocation, which incrementally build up the StatusUpdate object that gets actually posted. (This is a common Java strategy that basically helps get around the fact that parameters to classes can neither be specified by name in Java nor have defaults, the way Scala does.)

Checking and managing rate limit information

None of the above code makes many requests from Twitter, so there was little danger of exceeding rate limits. These limits are a mixture of both time and number of requests: you basically get a certain number of requests every hour (currently 350) per authenticating user. Because of these limits, you should consider accessing tweets, timelines, and such using the streaming methods when you can.

Every response you get from Twitter comes back as a sub-class of twitter4j.TwitterResponse, which not only gives you what you want (like a QueryResult) but also gives you information about your connection to Twitter. For rate limit information, you can use the getRateLimitStatus method, which can then inform you about the number of requests you can still make and the time until your limit resets.

The trait RateChecker below has a function checkAndWait that, when given a TwitterResponse object, checks whether the rate limit has been exceeded and wait if it has. When the rate is exceeded, it finds out how much time remains until the rate limit is reset and makes the thread sleep until that time (plus 10 seconds) has passed.

trait RateChecker {

  def checkAndWait(response: TwitterResponse, verbose: Boolean = false) {
    val rateLimitStatus = response.getRateLimitStatus
    if (verbose) println("RLS: " + rateLimitStatus)

    if (rateLimitStatus != null && rateLimitStatus.getRemaining == 0) {
      println("*** You hit your rate limit. ***")
      val waitTime = rateLimitStatus.getSecondsUntilReset + 10
      println("Waiting " + waitTime + " seconds ( " + waitTime/60.0 + " minutes) for rate limit reset.")


Using rate limits is actually more complex than this. For example, this strategy ignores the fact that different request types have different limits, but it keeps things simple. This is surely not an optimal solution, but it does the trick for present purposes.

Note also that you can directly ask for rate limit information from the twitter4j.Twitter instance itself, using the getRateLimitStatus method. Unlike the results for the same method on a TwitterResponse, this gives a Map from various request types to the current rate limit statuses for each one. In a real application, you’d want to control each of these different limits at a more fine-grained level using this information.

Not all of the methods of Twitter4j classes actually hit the Twitter API. To see whether a given method does, look at its Javadoc: if it’s description says “This method calls“, then it does hit the API. Otherwise, it doesn’t and you don’t need to guard it.

Examples using the checkAndWait function are given below.

Creating a word cloud from followers’ descriptions

Here’s a more interesting task: given a Twitter user, compute the counts of the words in the descriptions given in the bios of their followers and build a word cloud from them. The following code does this, outputing the resulting counts in a file, the contents of which can be pasted into Wordle’s advanced word cloud input.

object DescribeFollowers extends TwitterInstance with RateChecker {

  def main(args: Array[String]) {
    val screenName = args(0)
    val maxUsers = if (args.length==2) args(1).toInt else 500
    val followerIds = twitter.getFollowersIDs(screenName,-1).getIDs

    val descriptions = followerIds.take(maxUsers).flatMap { id => {
      val user = twitter.showUser(id)
      if (user.isProtected) None else Some(user.getDescription)

    val tword = """(?i)[a-z#@]+""".r.pattern
    val words = descriptions.flatMap(_.toLowerCase.split("\\s+"))
    val filtered = words.filter(_.length > 3).filter(tword.matcher(_).matches)
    val counts = filtered.groupBy(x=>x).mapValues(_.length)
    val rankedCounts = counts.toSeq.sortBy(- _._2)

    val wordcountFile = "/tmp/follower_wordcount.txt"
    val writer = new BufferedWriter(new FileWriter(wordcountFile))
    for ((w,c) <- rankedCounts)


The thing to consider is that if you are pointing this at a person with several hundred followers, you will exceed the rate limit. The call to getFollowersIDs is a single hit, and then each call to showUser is a hit. Because the showUser calls come in rapid succession, we check the rate limit status after each one using checkAndWait (which is available because we mixed in the RateChecker trait) and it waits for the limit to reset as previously discussed, keeping us from exceeding the rate limit and getting an exception from Twitter.

The number of users returned by getFollowersIDs is at most 5000. If you run this on a user who has more followers, followers beyond 5000 won’t be considered. If you want to tackle such a user, you’ll need to use the cursor, which is the integer provided as the argument to getFollowersIDs, and make multiple calls while incrementing that cursor to get more.

Most of the rest of the code is just standard Scala stuff for getting the word counts and outputting them to a file. Note that a small effort is done to reduce the non-alphabetic characters (but allowing # and @) and filtering out short words.

As an example of the output, when put into Wordle, here is the word cloud for my followers.


This looks about right for me—completely expected in fact—but it is still cool that it comes out of my followers’ self descriptions. One could start thinking of some fun algorithms for exploiting this kind of representation of a user to look into how well different users align or don’t align with their followers, or to look for clusters of different types of followers, etc.

Retweeting automatically

Tired of actually reading those tweets in your timeline and retweeting some of them? The following code gets some of the accounts the authenticating user follows, grabs twenty of those users, filters them to get interesting ones, and then takes up to 10 of the remaining ones and retweets their most recent statuses (provided they aren’t replies to someone else).

object RetweetFriends extends TwitterInstance with RateChecker {

  def main(args: Array[String]) {
    val friendIds = twitter.getFriendsIDs(-1).getIDs
    val friends = friendIds.take(20).map { id => {
      val user = twitter.showUser(id)

    val filtered = friends.filter(admissable)
    val ranked = => (f.getFollowersCount, f)).sortBy(- _._1).map(_._2)

    ranked.take(10).foreach { friend => {
      val status = friend.getStatus
      if (status!=null && status.getInReplyToStatusId == -1) {
        println("\nRetweeting " + friend.getName + ":\n" + status.getText)

  def admissable(user: User) = {
    val ratio = user.getFollowersCount.toDouble/user.getFriendsCount
    user.getFriendsCount < 1000 && ratio > 0.5


The getFriendsIDs method is used to get the users that the authenticating user is following (but who do not necessarily follow the authenticating user, despite the use of the word “friend”). We again take care with the rate limiting on gathering the users. We filter these users, looking for those who follow fewer than 1000 users and those who have a follower/friend ratio of greater than .5, in a simple attempt to filter out some less interesting (or spammy) accounts. The remaining users are then ranked according to their number of followers (most first). Finally, we take (up to) 10 of these (the take method returns 3 things if you ask for 10 but there are just 3), look at their most recent status, and if it is not null and isn’t a reply to someone, we retweet it. Between each of these, we wait for 30 seconds so that anyone following our account doesn’t get an avalanche of retweets.


This post and the related code should provide enough to get a decent feel for working with Twitter4j, including necessary setup and using some of the methods to start creating applications with it in Scala. See project phase three of my Applied NLP course to see exercises and code that takes this further to do interesting things for automated bots, including mixing streaming access and user access to get more complex behaviors.


Get every new post delivered to your Inbox.

Join 2,845 other followers