Talk about coming out of the closet — Amina Abdallah Arraf, the outspoken lesbian Muslim blogger from Damascus who was kidnapped by Syrian security forces, turns out to be 40-year-old Tom MacMaster, a straight American man living in Scotland. In a bizarre coincidence that has surely disappointed straight guys all over the world, the lesbian blogger Paula Brooks who had flirted with Amina and unwittingly helped to spread MacMaster’s hoax via the blog LezGetReal also turned out to be a man, the straight (and married) Bill Graber.


Girlfriends “Amina Abdallah Arraf” and “Paula Brooks”
The manifold social, political, journalistic, blogospheric, and other implications have been, and will continue to be, discussed ad nauseum, but what I’d like to ask, as a techie, is:
Could these deceptions have been discovered before so many people were sucked into them?
If we could have known somehow that Amina’s blog posts were likely written by a man, suspicion would have arisen sooner, and less damage would have been done.
As it turns out, recent advances in automated text analytics can help do just that.
Now, everyone knows that people from different backgrounds tend to speak and write a little bit differently. You’d expect someone from New Orleans not to sound like someone from London’s East End. A bit more surprising is the fact that men and women, even from similar backgrounds, speak and write differently. Though the differences are small (slight tendencies to prefer certain words or sentence structures to others), modern data mining algorithms can predict fairly accurately whether the author of an anonymous text was male or female, and do so noticeably better than people can. (Yes, V. S. Naipaul, the Nobel prize winning author, has said that he can tell within two paragraphs if a book was written by a woman, but his claim has not yet been scientifically tested, to say the least.)
The New Scientist recently reported how researchers at Stevens Institute of Technology have applied text mining to the “Gay Girl in Damascus” blog, showing that beneath the surface rhetoric of a lesbian woman lurked textual features more typical of a male’s writing (as indeed it was). Experiments using our Identitext system confirmed this result, as well as showing that Bill Graber’s posts on LezGet Real also appear more male than female. (If only these tests had been done earlier!)
Of course, outing hoaxer lesbian bloggers as straight men is not exactly a growth industry (let’s hope not!). But there are plenty of mainstream applications for this new technology, as it matures. Automatically analyzing the vast contents of online conversations is becoming more and more central to market research and business intelligence, and so knowing the demographics of the writers is becoming quite important. Such demographic profiling also offers the promise of more focused generation of leads for online businesses, or for understanding better what sorts of people read and comment on your blog or content website. And this is not to mention the potential applications in counter-terrorism and criminal investigations; these new computerized techniques will soon be standard tools in the arsenal of the intelligence analyst and the forensic linguist.







Contentious robot fish of the Bible
…or, some highlights from the 2011 Conference of the Association for Computational Linguistics
First, a word from our curmudgeon: Far too many of the papers, unfortunately as usual, could be summarized as, “We applied several variations of a new complicated method, and one of the runs gave 1% improved results over the simple method that’s been around for 15 years.” Talks with new ideas were therefore very refreshing! I’ll talk about a few of these below (more to hopefully follow in future posts).
Identifying opposing sides in political debate from news articles
So much of sentiment analysis research over the last few years has amounted, more or less, to “how can we improve classification of positive and negative movie reviews by a couple of percent by adding features to bag-of-words?” None of these techniques add any new insights into the problem, nor do they improve results in any real way. The vein of incremental progress was exhausted years ago; advances now require new ways of looking at the problem.
A new view of sentiment analysis, reminding us that opinions have to do with people, was offered by Souneil Park, Kyung Soon Lee, and Junehwa Song in their paper Contrasting Opposing Views of News Articles on Contentious Issues, which described an approach not to classifying articles according to sentiment (which is usually not even a well-defined task), but to identifying the key figures on opposite sides of (political) debates, by analyzing newspaper text. The key idea is that figures (people or organizations) on each side will tend to attack figures on the other side, so if we can identify enough such attacks, we can automatically discover what the sides are (and who is on each). Sentiment analysis on sentences such as, “The government defined that the attack of North Korea is an act of invasion,” can (roughly) classify if one actor (“the government”) is saying positive or negative things (in this case, negative: “attack”, “invasion”) about another (“North Korea”).
“As cold as a robot fish” — from n-grams to linguistic creativity
Tony Veale, of University College Dublin, presented a fascinating exploration of how massive, but simple, language data can be mined for Creative Language Retrieval. By searching for certain kinds of fixed word patterns, pairs of stereotypically related terms can be found, which can then be used in creative ways. For example, by noting that “as cold as a fish” is common, the system knows that fish are stereotypically cold, since “as cold as a robot” is also common, the system knows that robots are stereotypically cold, and finally, since the phrase “robot fish” occurs more often than one would expect by chance, there is such a thing as a robot fish. The system can then create the novel (and evocative) simile, “as cold as a robot fish”. Novel metaphors can also be created — “he is a wise professor” can be transformed into “he is a wise old owl”. To see some very entertaining online demos of this technology, see
Educated Insolence, where the demos include Idiom Savant, The Jigsaw Bard, and Aristotle, among others. This site is as exciting as the lifestyle of a princess (as the Jigsaw Bard puts it).
Automatic source criticism of the Hebrew Scriptures
For centuries Bible scholars have attempted to identify distinct authorial voices within books of the Bible, though their methods have often been criticized as subjective and even impressionistic. Moshe Koppel, Navot Akiva, Nachum Dershowitz, and Idan Dershowitz have devised an algorithm that automatically divides a document into distinct strands by automatically detecting which parts of the text make different choices among available synonyms. For example, some parts of the text will consistently use the Hebrew word makel (meaning “staff”), while others will consistently use mateh (also meaning “staff”). By formalizing and generalizing this phenomenon, the researchers showed that when two Biblical books, such as Jeremiah and Ezekiel, are randomly mixed together, the merged book could be automatically almost-perfectly separated out to its constituent components.
UPDATE: See this AP story on this research.
- Posted using BlogPress from my iPad