Sunday, June 8, 2008

The Simpson's Paradox

Here's one rather amusing stuff. Assume there are two players A and B. We have data on how many games each of them won versus how many they played for the past 2 years and we are trying to figure out who is a better player. Here are the numbers:

So lets find out who's better. Here is the result across 2 years.

Lets verify this by doing this for each year. Here are the results:

Wait a minute! How is that different? If A is better than B in the aggregate, how can B be better when we take individual years? Well, that's what is called Simpson's paradox. Its more an anomaly in statistics one needs to be careful of rather than a paradox. Relations may weaken or reverse when another strongly affecting factor is ignored. It is simpler to visualize with the vector graph below ... through which is becomes apparent that it is due to large difference in magnitudes of the vectors. In the graph red represents Player A and blue represents Player B.


Sambit Kumar Dash said...

Hi Tanmay,

This is a very interesting observation. I will suggest probably the statistics of comparing both individual years is not relevant as the sample sizes are not equivalent as you rightly pointed out. The combined years results may give a superior understanding than individual years. Hence, most of cases like these it's normally stated comparison of players who have played 100 or more matches in a year so that there is comparability of statistics. More over people are interested in getting player of the year recognition and not a career best statistics. That's where minimum number of games played in a year is an important metric.


Tanmay said...

Hi Sambit,
yes, sample sizes are the key. It was quite intriguing when I encountered this the first time though. :)