Contentious robot fish of the Bible

…or, some highlights from the 2011 Conference of the Association for Computational Linguistics

First, a word from our curmudgeon: Far too many of the papers, unfortunately as usual, could be summarized as, “We applied several variations of a new complicated method, and one of the runs gave 1% improved results over the simple method that’s been around for 15 years.” Talks with new ideas were therefore very refreshing! I’ll talk about a few of these below (more to hopefully follow in future posts).

Identifying opposing sides in political debate from news articles

So much of sentiment analysis research over the last few years has amounted, more or less, to “how can we improve classification of positive and negative movie reviews by a couple of percent by adding features to bag-of-words?” None of these techniques add any new insights into the problem, nor do they improve results in any real way. The vein of incremental progress was exhausted years ago; advances now require new ways of looking at the problem.

A new view of sentiment analysis, reminding us that opinions have to do with people, was offered by Souneil Park, Kyung Soon Lee, and Junehwa Song in their paper Contrasting Opposing Views of News Articles on Contentious Issues, which described an approach not to classifying articles according to sentiment (which is usually not even a well-defined task), but to identifying the key figures on opposite sides of (political) debates, by analyzing newspaper text. The key idea is that figures (people or organizations) on each side will tend to attack figures on the other side, so if we can identify enough such attacks, we can automatically discover what the sides are (and who is on each). Sentiment analysis on sentences such as, “The government defined that the attack of North Korea is an act of invasion,” can (roughly) classify if one actor (“the government”) is saying positive or negative things (in this case, negative: “attack”, “invasion”) about another (“North Korea”).

“As cold as a robot fish” — from n-grams to linguistic creativity

Tony Veale, of University College Dublin, presented a fascinating exploration of how massive, but simple, language data can be mined for Creative Language Retrieval. By searching for certain kinds of fixed word patterns, pairs of stereotypically related terms can be found, which can then be used in creative ways. For example, by noting that “as cold as a fish” is common, the system knows that fish are stereotypically cold, since “as cold as a robot” is also common, the system knows that robots are stereotypically cold, and finally, since the phrase “robot fish” occurs more often than one would expect by chance, there is such a thing as a robot fish. The system can then create the novel (and evocative) simile, “as cold as a robot fish”. Novel metaphors can also be created — “he is a wise professor” can be transformed into “he is a wise old owl”. To see some very entertaining online demos of this technology, see
Educated Insolence, where the demos include Idiom Savant, The Jigsaw Bard, and Aristotle, among others. This site is as exciting as the lifestyle of a princess (as the Jigsaw Bard puts it).

Automatic source criticism of the Hebrew Scriptures

For centuries Bible scholars have attempted to identify distinct authorial voices within books of the Bible, though their methods have often been criticized as subjective and even impressionistic. Moshe Koppel, Navot Akiva, Nachum Dershowitz, and Idan Dershowitz have devised an algorithm that automatically divides a document into distinct strands by automatically detecting which parts of the text make different choices among available synonyms. For example, some parts of the text will consistently use the Hebrew word makel (meaning “staff”), while others will consistently use mateh (also meaning “staff”). By formalizing and generalizing this phenomenon, the researchers showed that when two Biblical books, such as Jeremiah and Ezekiel, are randomly mixed together, the merged book could be automatically almost-perfectly separated out to its constituent components.

UPDATE: See this AP story on this research.

- Posted using BlogPress from my iPad

Posted in Commentary, Computational Linguistics, Science | Leave a comment

You can fake, but you can’t hide

Talk about coming out of the closet — Amina Abdallah Arraf, the outspoken lesbian Muslim blogger from Damascus who was kidnapped by Syrian security forces, turns out to be 40-year-old Tom MacMaster, a straight American man living in Scotland. In a bizarre coincidence that has surely disappointed straight guys all over the world, the lesbian blogger Paula Brooks who had flirted with Amina and unwittingly helped to spread MacMaster’s hoax via the blog LezGetReal also turned out to be a man, the straight (and married) Bill Graber.

Girlfriends “Amina Abdallah Arraf” and “Paula Brooks”

The manifold social, political, journalistic, blogospheric, and other implications have been, and will continue to be, discussed ad nauseum, but what I’d like to ask, as a techie, is:

Could these deceptions have been discovered before so many people were sucked into them?

If we could have known somehow that Amina’s blog posts were likely written by a man, suspicion would have arisen sooner, and less damage would have been done.

As it turns out, recent advances in automated text analytics can help do just that.

Now, everyone knows that people from different backgrounds tend to speak and write a little bit differently. You’d expect someone from New Orleans not to sound like someone from London’s East End. A bit more surprising is the fact that men and women, even from similar backgrounds, speak and write differently. Though the differences are small (slight tendencies to prefer certain words or sentence structures to others), modern data mining algorithms can predict fairly accurately whether the author of an anonymous text was male or female, and do so noticeably better than people can. (Yes, V. S. Naipaul, the Nobel prize winning author, has said that he can tell within two paragraphs if a book was written by a woman, but his claim has not yet been scientifically tested, to say the least.)

The New Scientist recently reported how researchers at Stevens Institute of Technology have applied text mining to the “Gay Girl in Damascus” blog, showing that beneath the surface rhetoric of a lesbian woman lurked textual features more typical of a male’s writing (as indeed it was). Experiments using our Identitext system confirmed this result, as well as showing that Bill Graber’s posts on LezGet Real also appear more male than female. (If only these tests had been done earlier!)

Of course, outing hoaxer lesbian bloggers as straight men is not exactly a growth industry (let’s hope not!). But there are plenty of mainstream applications for this new technology, as it matures. Automatically analyzing the vast contents of online conversations is becoming more and more central to market research and business intelligence, and so knowing the demographics of the writers is becoming quite important. Such demographic profiling also offers the promise of more focused generation of leads for online businesses, or for understanding better what sorts of people read and comment on your blog or content website. And this is not to mention the potential applications in counter-terrorism and criminal investigations; these new computerized techniques will soon be standard tools in the arsenal of the intelligence analyst and the forensic linguist.

Posted in Commentary, Demographic profiling, Science, Text analytics | Leave a comment

Market Research is a Science!

Ray Poynter just published an excellent and very thought-provoking piece on his blog about the relationship between market research and science. Two of the key points he makes are (a) that science focuses on the general (or generalizable) while market research studies the particular, and (b) that science strives for objectivity, while market research inevitably involves many subjective choices (sample, timing, method, etc.). I’m far from an expert on market research, but I’d like to contribute my two cents about the nature of science and scientific methodology, and how it may relate to market research.

The question “What is science?” is a vexing one, and the grade-school answer invoking a Scientific Method that involves hypotheses and predictions and controlled experiments does not really hold up under investigations of what scientists actually do. The issue is much more complicated, and full of dispute even among professional philosophers of science. An excellent exposition on the topic by the great 20th century physicist Richard Feynman is well-worth reading – a key point that he makes is that science is about “doubt[ing] that what is being passed from the past is in fact true, and try[ing] to find out ab initio again from experience what the situation is.” I would distill this even further, and say that the essence of science is the effort to disprove what you think you know.

This deceptively simple idea does not just apply to investigations of abstract, general laws of behavior, as are studied in physics. Biologists study both general patterns of how living things work and also the intricate and often obscure workings of particular organisms; the latter is no less science than the former. Even further afield, paleontologists, evolutionary biologists, archaeologists, and even astrophysicists attempt to work out what specifically happened in the past, regarding particular situations and events that cannot be reproduced. Such historical sciences seek to understand the particular by amassing many complex pieces of evidence that can support or refute different hypotheses about it. (For more on this, see Wikipedia on paleontology and the work of philosopher Carol Cleland.)

A key point is that when dealing with the particular, there is rarely a “smoking gun” piece of evidence to support or refute a hypothesis (as there may be in, say, physics), in part because controlled experiments are not possible. What is needed is a certain amount of professional judgment in terms of how to weight and combine the myriad pieces of evidence that may bear upon a particular question. This judgment is personal, to some extent, but is justifiable in terms of evidence, experience, and professional norms.

I would suggest, therefore, that much of market research might be conceptualized as science, but as a kind of “science of the particular,” similar to historical sciences, in which multiple working hypotheses about a specific question can be entertained, and professional judgment (personal and debatable, but not strictly subjective) used to design studies and evaluate complex evidence.

On this view, market research is not much like engineering or other scientific applications, except where standard methods can be brought to bear on standardizable questions. To the extent that this may be the case, the analogy to engineering works. But if, as I understand it, many MR studies examine questions that are unique in certain essentials, there is more scientific discovery than engineering discipline involved.

Some “alternative paradigms” (in Poynter’s terms) would question if researchers accept the existence of an objective reality that they are examining, and hence can talk meaningfully about evidence for or against specific ideas. However, the idea that much of what we can study about people is socially constructed does not mean that it does not exist objectively, but rather that we are studying specific phenomena that can change over time, just as biological species may change over time.

The fundamental point, I think, in seeing MR as scientific, is the effort to be skeptical of one’s own ideas, to try and disprove them, so as to overcome confirmation bias. This is something that researchers often address explicitly with their clients, showing them evidence that their preconceptions were incorrect. To be scientific, researchers just need to shine that spotlight on themselves, examining their evidence, methods, arguments, and conclusions with rigorous criticism. This should be part of the methodological toolkit of every researcher.

That said, I think that Poynter’s final conclusions are absolutely correct, and worth repeating:

  • Different researchers give different answers (because of design in quant and the paradigm in qual).
  • If you are not an expert, judge market research in terms of past performance and recommendations.

Posted in Commentary, Market research, Science | 3 Comments

NGMR Meme Contest

Market research innovator and text analytics application pioneer Tom Anderson has been running an NGMR Meme Contest, and I’ve finally sent in a submission.

The other submissions are also very well worth looking at. One that I was dubious about at first, because of the ubiquity of Downfall parodies on YouTube, but that I ended up enjoying, was:

And don’t miss the following:

Head in the sand

Coolest researcher

The whole set can be seen on flickr.

Posted in NGMR | 1 Comment

Put down the duckie if you want to play the saxophone!

This old song from Sesame Street has been going through my head the last couple of days:

What duckie do you need to put down in order to play the saxophone?

Don’t forget though, as Ernie astutely notes in the epilogue, sometimes you need to put down the saxophone so you can squeak your duckie…

Posted in Uncategorized | 2 Comments

The NGMR Top-5-Hot vs. Top-5-Not: Computational intelligence and contextualized data

A major game-changer in market research (as for much else) over the last few years has been the explosion of powerful computational analytics and the enormous expansion of available data, fueled by the internet. A great and diverse horde of new ideas, techniques, and systems have been deployed, such as text analytics, sentiment analysis, social network analysis, web analytics, data visualization, and on and on.

So what has staying power and what is destined to fade away?

To make things a little more interesting (at least to me), let’s ignore, for the most part, existing ideas and methods, and look at some interesting things emerging from the ether, not necessarily directly related to market research but with definite relevance. And as I’m a confirmed interdisciplinarian, I’ll emphasize ideas coming from the collision of disparate disciplines, as these are among the most interesting and likely most impactful. Will they really be hot? Well, who the heck knows? But it would be very surprising indeed if something very like each of these were not in a top-ten list of key NGMR developments in the next few years.

The big theme: Bridging the gap between sweeping qualitative analyses and highly granular and quantitative analyses by using new techniques from computational intelligence on big data to contextualize patterns and identify niches and segments.


Long tail lemur

  1. Using big data and analytics to find more specific niche markets in the long tail of the distribution of consumer preferences.  Most preferences in many markets are niche preferences, so analyses that only find overall preferences and trends will inevitably leave a large part of the market on the table.  This is where segmentation must come in, but it is only recently that tools for demographic and sociographic analysis for online data are becoming available (since we only observe online, and can’t directly ask questions). These tools, when they are applied to the immense numbers of virtual respondents that can be examined in automated online studies, will enable more detailed understanding and exploitation of the myriad of niche segments hiding in the long tail.
  2. Red and Blue

  3. Using text mining as an experimental methodology, not just for data exploration and classification as is currently done. This focuses attention on the important role of interpretation and modeling at the textual level, when using automatic text processing to gauge thought and opinion. When dealing with the complex mathematical and linguistic models that are used in text analytics, it is hard to know exactly what results mean without a bit more investigation. This will amount, in some sense, to a fusion between qualitative and quantitative approaches to understanding online data, or, in other words, will bring the a priori models and intuitions of qualitative analysis into the automated and scalable quantitative techniques, for far greater insight and understanding. This sort of bidirectional method (explore/classify, model/test) has been applied to interesting work in digital humanities both to help find novel and interesting scholarly hypotheses in masses of data and also to experimentally test them.  These methods will be applied and expanded to market research, enabling more accurate and deeper understandings to emerge from online data.

  4. Deeper and more meaningful visualization techniques for seeing real patterns in enormous data (text, social networks, and more).  Word clouds are already clichéd and never really gave much more than a kind of insightiness.  Developing greater understanding and sophistication about visualization methods will enable us to move beyond bar-charts and word clouds to more informative visualizations that represent real insights.
  5. Sierpinsky triangle

  6. The best market research has always relied on using multiple methodologies in conjunction to substantiate results.  As the variety and size of data sources only increases, a sort of massive triangulation is becoming crucial, and thus current less-formal “meta-research methods” will develop into more rigorous methodologies that integrate both qualitative and quantitative analysis of diverse kinds of online and offline data.
  7. 2010 Northwest Pinball and Game Room Show

  8. One of the most exciting ideas I’ve seen is that of Research Through Gaming, where people play fun interactive games devised by researchers to provide useful information based on how they play the games.  This notion bridges the gap between high-contact interactive methods (such as focus groups and in-person surveys) and high-volume non-interactive methods (such as online sentiment analysis), and also promises more precise and new kinds of understanding, since subjects can be observed minutely as they act in highly controllable environments. One trick will be figuring out how best to devise such games to yield useful information efficiently.  The other will be ensuring that the games are fun so that people play them and share them (hopefully going viral). If these problems can be solved, market research games will utterly transform the field over the next decade if not much sooner.


  1. Online research using text, web, and social analytics that just gives overall trends or comparisons without detailed segmentation or identifying niche segments.
  2. Use of computational analytic tools and listening/monitoring systems as black boxes that provide useful information. Researchers will become as sophisticated in understanding text and network analytics as they are in understanding statistics.
  3. Research where one methodology or data type predominates, such as work that eschews sentiment analysis, or doesn’t account for social network effects, or ignores the possibility of in-depth qualitative analysis.  Research will become highly multimethodological.
  4. Examining data only at a limited range of scales, whether small (as in focus groups), medium (as in telephone or online surveys), or huge (as in social media analytics).  Multi-scale triangulation is the key, and larger scale analysis, even if more shallow, are needed to contextualize small-scale deeper results.
  5. Market research online communities (MROCs). I’ll go out on a limb and assert that while they will still exist, they will not be very important in a few years, as methods for gathering high-volume, high-quality data naturally from the internet improve.  MROCs are expensive to create and to maintain, and user interaction in less artificial settings (such as blogs, twitter, and interactive games) will largely take their place, as we get better at extracting and interpreting information from them.

So, what do you think?


Posted in NGMR, Prognostication, Text analytics | 3 Comments

“Who are they?” Not just, “What do they want?”

Franklin and Eleanor (FDR Bio, part 1)photo © 2007 Tony | more info (via: Wylio)

Ever since George Gallup correctly predicted that Franklin Roosevelt would win the 1945 Presidential elections, market research has been a critical element in politics as well as in business. The statistical basis for extrapolating from sampled data hasn’t changed much, but the methods for collecting the data have changed quite a bit.

The first market researchers stopped people in shopping centers and other public places and asked them to answer questionnaires. Direct mail followed and then phone surveys – alas, those are still with us. The internet has now made it easier and cheaper to conduct surveys, though reliable sampling can be tricky. Focus groups are used for more in depth qualitative research, though they tend to be very expensive.

Over the last couple of years, though, a revolution has been brewing in market research, as methods for automatically mining online text for facts and opinions have been developed which allow automated lexical analysis of blogs, news stories and even Twitter messages. These can determine people’s attitudes towards whatever is being researched without needing to ask any questions. Large numbers of people can thus be ‘polled’ at no inconvenience to them, and with less intrusion of researchers’ biases.

However, these methods, while powerful, are still limited in important ways. All market surveys are comprised of two parts, questions probing the area being researched and those seeking general demographic data on the respondents. This is because knowing what people think overall is much less valuable than if you can break it down by demographics.

For example, let’s say a candidate for political office would like to know what the issues most important to his electorate are. A survey of politically oriented blogs and comments may show that health care and the economy each account for 30% of public discourse overall. This might imply that the candidate should spend 30% of each appearance on health care and 30% on the economy.

With demographic data, however, it may well turn out that the candidate would be better off spending 70% of his time on health care when speaking in retirement homes and 70% of his time on the economy when speaking at sports-related events.

Park Benches of the South Beach Area of Miami Beach Are Favorite Meeting Places for Members of the Area's Large Retirement © 1975 The U.S. National Archives | more info (via: Wylio)

Fansphoto © 2007 Steven Wilke | more info (via: Wylio)

As we see, adding back in the missing demographic data gives meaning to the lexically analyzed data and turns it from being merely generally informative to being a powerful tool for getting the right message to the right audience.

Subtext3’s Identitext demographic profiling web-service determines age and gender information for online texts on the basis of textual analysis. When combined with semantic and sentiment analysis, this metadata enables you to form a complete picture of your digital data for market and political insight, one currently not available anywhere else.

Identitext is currently being made available in a restricted beta evaluation release; to request an evaluation, please contact us.

Posted in Demographic profiling, Identitext, Text analytics | Leave a comment

Identitext demographic identification system available in beta

Subtext3 announces availability of its Identitext demographic identification tool for beta testing. Identitext is a Web-service tool that can be used to identify the age and gender of the author of a free text document. Users send the text document or documents, each of which may be between 150 and 100,000 characters, to Identitext via the system’s URI and receive back the age and gender of the writer, along with a measurement of statistical certainty for this information.

Identitext adds value to marketers using social networks and blogs to gather information about attitudes towards products, services or people, by associating age and gender groups with these attitudes.

This is a restricted beta; please contact to receive a free account and detailed instructions for getting started.

Posted in Announcement, Demographic profiling, Identitext, Text analytics | Leave a comment

Elementary (education), my dear Watson

As I write this, my 10-year-old is doing her 5th grade computer homework – creating a PowerPoint presentation on the grasslands biome, complete with colorful backgrounds, funky animations, and ubiquitous sound effects.  This assignment is by far more time-consuming than her “reading, writing, and ‘rithmetic” homework, but somehow I can’t quite appreciate the educational importance of teaching kids to use PowerPoint. (Call me a Luddite, but I think her time would be better spent writing an old-fashioned essay.)

Oddly, I think that when kids need (or want) to use PowerPoint (for pitching VCs on a new social-iterative-lean-multi-dimensional lemonade-stand business model?), they’ll be able to figure things out pretty well on their own.  Elementary education is for teaching the fundamentals.  Learning specific computer applications is not the fundamentals.

I imagine that her teacher (or the school administration) feels that students need to learn about computers, in today’s highly technological society.  But learning about computers by creating PowerPoint presentations is like learning about English grammar by reading (not analyzing) comic books.

But should kids learn anything about computers in school?  A good (and brilliant) friend of mine thinks that teaching kids computer science today is like teaching kids auto mechanics 100 years ago – trendy, but useless for the vast majority.

This view is absolutely correct, if by “computer science education” we mean what my daughter’s school is teaching, or even if it meant teaching kids basic Java (or Perl, or Ruby) programming.

But computation is so much more than that, and it is something that all kids (and adults) would benefit from learning (and not just in a vague “Latin improves the mind” sort of way). The key is what Jeannette Wing has called “Computational Thinking”.

The development of computer science over the last 60 (or perhaps 120) years has given rise to radically new ways of thinking about information and about process; teaching children these ways of thinking is more akin to the classical focus on logic and rhetoric than on the focus on ancient languages.

A brief anecdote: Early in my graduate school career, I was interviewed as a candidate for a graduate fellowship – the interviewer was an engineer, not a computer scientist (the relevance will become clear shortly).  Among the questions he asked me was to solve the following problem:

Consider a game where two players alternately flip a coin. The first one to get heads wins. What is the probability that the second player will win?

I quickly solved the problem (answer: 1/3), and the interviewer asked me to show him how I solved it, seeming quite surprised at how quickly I had done so.  He had expected me to formulate an infinite series and solve it, whereas I had solved it by the following recursive formulation:

T = 1/2 + 1/4 T The chance of the 1st player winning on his first turn is 1/2, or he has a 1/4 chance of getting to his next turn, where his chance of winning is the same.
3/4 T = 1/2 Rearrange terms
T = 2/3 The probability of the 1st player winning is 2/3 so the probability of the second player winning is 1/3.

Much simpler than solving an infinite series, no?  (The old anecdote about von Neumann comes to mind.)  But this way of solving the problem greatly surprised a well-educated and experienced engineer.  (And, yes, I got the fellowship.)

Computational thinking encompasses ways of thinking about structure and process that very often can greatly simplify thinking about very complex problems.  Just a few of the fundamental ideas are:

These are concepts that are not only changing the way economists, and biologists, and statisticians, and even humanities scholars approach their disciplines, but also have great applicability to everyday life.  As our society grows ever more complex and interconnected, and the policy implications of technological innovation grow ever more profound, it is becoming essential for all educated citizens to understand and know how to deal with this complexity.

In other words, to learn computational thinking.

Posted in Commentary | 3 Comments

Sweet Tea Vodka – A Marketing Dilemma

Sweet Tea Vodkas have recently made a splash. It’s an interesting drink, being both extremely sweet and high octane.

Sweet tea vodkaphoto © 2010 angelo | more info (via: Wylio)

There is plenty of demographic information available about tea, there is also much data available regarding the demographics of Vodka. However, whereas the predominant demographic for Vodka are men and women from 21 to 34, the predominant tea drinkers are women between 30 and 50.

Who then is the proper market to aim Sweet Tea Vodka at? One could play it safe with women between 30 and 34 but that might be a little more limiting than the manufacturers want to hear.

The technology for following blogs, tweets and other social media to determine attitudes towards Sweet Tea Vodka exists and is used for marketing surveys by several major marketing companies. While this sentiment analysis is helpful for determining the overall acceptance of a new concept drink, it does little to narrow down the demographic niche that the manufacturers should be targeting.

Enter Subtext3, a startup that can match a demographic description to a free text writeup based on subject matter, use of language and word frequency. Based on mathematical models developed over years of research, they can read a text and tell you the age and gender of the writer along with the statistical degree of certainty.

Used in conjunction with the sentiment analysis techniques described above, Subtext3’s technology enables the marketing executive to point his clients towards the demographic niche they should be focusing on, the advertising media they should be using and the growth they should be expecting.

Then he can sit back and enjoy a long cool drink.

Posted in Demographic profiling, Social media, Text analytics | Tagged , | Leave a comment