You can fake, but you can’t hide

Talk about coming out of the closet — Amina Abdallah Arraf, the outspoken lesbian Muslim blogger from Damascus who was kidnapped by Syrian security forces, turns out to be 40-year-old Tom MacMaster, a straight American man living in Scotland. In a bizarre coincidence that has surely disappointed straight guys all over the world, the lesbian blogger Paula Brooks who had flirted with Amina and unwittingly helped to spread MacMaster’s hoax via the blog LezGetReal also turned out to be a man, the straight (and married) Bill Graber.




Girlfriends “Amina Abdallah Arraf” and “Paula Brooks”

The manifold social, political, journalistic, blogospheric, and other implications have been, and will continue to be, discussed ad nauseum, but what I’d like to ask, as a techie, is:

Could these deceptions have been discovered before so many people were sucked into them?

If we could have known somehow that Amina’s blog posts were likely written by a man, suspicion would have arisen sooner, and less damage would have been done.

As it turns out, recent advances in automated text analytics can help do just that.

Now, everyone knows that people from different backgrounds tend to speak and write a little bit differently. You’d expect someone from New Orleans not to sound like someone from London’s East End. A bit more surprising is the fact that men and women, even from similar backgrounds, speak and write differently. Though the differences are small (slight tendencies to prefer certain words or sentence structures to others), modern data mining algorithms can predict fairly accurately whether the author of an anonymous text was male or female, and do so noticeably better than people can. (Yes, V. S. Naipaul, the Nobel prize winning author, has said that he can tell within two paragraphs if a book was written by a woman, but his claim has not yet been scientifically tested, to say the least.)

The New Scientist recently reported how researchers at Stevens Institute of Technology have applied text mining to the “Gay Girl in Damascus” blog, showing that beneath the surface rhetoric of a lesbian woman lurked textual features more typical of a male’s writing (as indeed it was). Experiments using our Identitext system confirmed this result, as well as showing that Bill Graber’s posts on LezGet Real also appear more male than female. (If only these tests had been done earlier!)

Of course, outing hoaxer lesbian bloggers as straight men is not exactly a growth industry (let’s hope not!). But there are plenty of mainstream applications for this new technology, as it matures. Automatically analyzing the vast contents of online conversations is becoming more and more central to market research and business intelligence, and so knowing the demographics of the writers is becoming quite important. Such demographic profiling also offers the promise of more focused generation of leads for online businesses, or for understanding better what sorts of people read and comment on your blog or content website. And this is not to mention the potential applications in counter-terrorism and criminal investigations; these new computerized techniques will soon be standard tools in the arsenal of the intelligence analyst and the forensic linguist.

This entry was posted in Commentary, Demographic profiling, Science, Text analytics. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.