Wednesday, November 12, 2008

Psephologists and data fitting

Psephologist: That's a real word. It means someone who studies elections and polls. That was the word I was reminded of, when I listened to Weekday on kuow.org this morning as I drove to work. Today's show had two guests, both professors at the University of Washington if I recall correctly talking about the recent US presidential election.

I first came across the word psephologist, back in India in the late eighties, while watching election coverage on TV. There were pundits on TV who were called in to give expert opinion on election polls and trends, and I learnt they are psephologists. Doordarshan had the excellent Vinod Dua and Prannoy Roy covering elections, and later NDTV (if I recall correctly the first private news channel launched in India, by Prannoy Roy) did a great job covering elections.

The guests on Weekday today had interesting takes on how the US presidential campaigns affected the polls. One argued that actions by one camp were mostly canceled out by actions from the other camp, and that throughout the campaign Sen. Obama maintained a lead over Sen. McCain, and that the voting could have happened at any time from July with the same results. The other guest argued that this time the elections were different and that trends amongst young voters and African Americans, and a larger voter registration among democrats were crucial to Sen. Obama's victory. At some level both were in agreement, but the opinion that the campaigning was somewhat irrelevant, as posited by one of the guests, was curious.

My own takeaway was that the same data could be interpreted in many different ways - so despite both professors having access to exactly the same data (or the same ocean of data), the interpretations varied depending on their existing ideas, biases and hypotheses.

I imagine that the election poll data is a data miner's dream - plenty of data points to use to segment the population, slice and dice the voting public into blocs and look at correlations and trends, and compare with previous elections. What fun!

However post-facto analysis of data without knowledge of cause and effect can lend itself to a lot of problems in interpretation. Today's radio discussion reminded me of how a lot of tv and radio anchors often talk about how a certain state or town has been reliably picking the presidential winner, or even how the winner of a football game predicts the winner (I need dig up the reference for this one). This kind of data fitting can be downright misleading; however such kinds of data correlations are often accidental and are mostly just plain irrelevant.

For a cautionary tale in data mining giving silly results, read about how butter production in Bangladesh is a great predictor of the S&P performance for two decades: David Leinweber went searching for random data and found something that fit the bill. Clearly however, there is no cause and effect relationship here and so the correlation is meaningless.

No comments: