Wednesday, August 7, 2013

Are your data normal? Hint: no.

One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.

There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.

But these tests are almost never what you really want.  When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset.  And that's not a statistical test; it is a modeling decision.

All statistical analysis is based on models, and all models are based on simplifications.  Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.

For example, the normal distribution is a good model for many physical quantities.  The distribution of human height is approximately normal (see this previous blog post).  But human heights are not normally distributed.  For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions.  But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.

So if you collect a sample of human heights and ask whether they come from a normal distribution, the answer is no.  And if you apply a statistical test, you will eventually (given enough data) reject the hypothesis that the data came from the distribution.

Instead of testing whether a dataset is drawn from a distribution, let's ask what I think is the right question: how can you tell whether a distribution is a good model for a dataset?

The best approach is to create a visualization that compares the data and the model, and there are several ways to do that.

Comparing to a known distribution

If you want to compare data to a known distribution, the simplest visualization is to plot the CDFs of the model and the data on the same axes.  Here's an example from an earlier blog post:


I used this figure to confirm that the data generated by my simulation matches a chi-squared distribution with 3 degrees of freedom.

Many people are more familiar with histograms than CDFs, so they sometimes try to compare histograms or PMFs.  This is a bad idea.  Use CDFs.

Comparing to an estimated distribution

If you know what family of distributions you want to use as a model, but don't know the parameters, you can use the data to estimate the parameters, and then compare the estimated distribution to the data. Here's an example from Think Stats that compares the distribution of birth weight (for live births in the U.S.) to a normal model.

This figure shows that the normal model is a good match for the data, but there are more light babies than we would expect from a normal distribution.  This model is probably good enough for many purposes, but probably not for research on premature babies, which account for the deviation from the normal model.  In that case, a better model might be a mixture of two normal distribution.

Depending on the data, you might want to transform the axes.  For example, the following plot compares the distribution of adult weight for a sample of Americans to a lognormal model:


I plotted the x-axis on a log scale because under a log transform a lognormal distribution is a nice symmetric sigmoid where both tails are visible and easy to compare.  On a linear scale the left tail is too compressed to see clearly.

UPDATE: Someone asked what that graph would look like on a linear scale:


So that gives a sense of what it looks like when you fit a normal model to lognormal data.

Instead of plotting CDFs, a common alternative is a Q-Q plot, which you can generate like this:

def QQPlot(cdf, fit):
    """Makes a QQPlot of the values from actual and fitted distributions.

    cdf: actual Cdf
    fit: model Cdf
    """
    ps = cdf.ps
    actual = cdf.xs
    fitted = [fit.Value(p) for p in ps]

    pyplot.plot(fitted, actual)

cdf and fit are Cdf objects as defined in thinkbayes.py.

The ps are the percentile ranks from the actual CDF.  actual contains the corresponding values from the CDF.  fitted contains values generated by looping through the ps and, for each percentile rank, looking up the corresponding value in the fitted CDF.

The resulting plot shows fitted values on the x-axis and actual values on the y-axis.  If the model matches the data, the plot should be the identity line (y=x).

Here's an example from an earlier blog post:


The blue line is the identity line.  Clearly the model a good match for the data, except the last point.  In this example, I reversed the axes and put the actual values on the x-axis.  It seemed like a good idea at the time, but probably was not.


Comparing to a family of distributions

When you compare data to an estimated model, the quality of fit depends on the quality of your estimate.  But there are ways to avoid this problem.

For some common families of distributions, there are simple mathematical operations that transform the CDF to a straight line.   For example, if you plot the complementary CDF of an exponential distribution on a log-y scale, the result is a straight line (see this section of Think Stats).

Here's an example that shows the distribution of time between deliveries (births) at a hospital in Australia:


The result is approximately a straight line, so the exponential model is probably a reasonable choice.
There are similar transformations for the Weibull and Pareto distributions.

Comparing to a normal distribution

Sadly, things are not so easy for the normal distribution.  But there is a good alternative: the normal probability plot.

There are two ways to generate normal probability plots.  One is based on rankits, but there is a simpler method:
  1. From a normal distribution with µ = 0 and σ = 1, generate a sample with the same size as your dataset and sort it.
  2. Sort the values in the dataset.
  3. Plot the sorted values from your dataset versus the random values.
Here's an example using the birthweight data from a previous figure:

A normal probability plot is basically a Q-Q plot, so you read it the same way.  This figure shows that the data deviate from the model in the range from 1.5 to 2.5 standard deviations below the mean.  In this range, the actual birth weights are lower than expected according to the model (I should have plotted the model as a straight line; since I didn't, you have to imagine it).

There's another example of a probability plot (including a fitted line) in this previous post.

The normal probability plot works because the family of normal distributions is closed under linear transformation, so it also works with other stable distributions (see http://en.wikipedia.org/wiki/Stable_distribution).


In summary: choosing an analytic model is not a statistical question; it is a modeling decision.  No statistical test that can tell you whether a particular distribution is a good model for your data.  In general, modeling decisions are hard, but I think the visualizations in this article are some of the best tools to guide those decisions.


UPDATE:

I got the following question on reddit:
I don't understand. Why not go to the next step and perform a Kolmogorov-Smirnov test, etc? Just looking at plots is good but performing the actual test and looking at plots is better.
And here's my reply:
The statistical test does not provide any additional information. Real world data never match an analytic distribution perfectly, so if you perform a test, there are only two possible outcomes: 
1) You have a lot of data, so the p-value is low, so you (correctly) conclude that the data did not really come from the analytic distribution. But this doesn't tell you how big the discrepancy is, or whether the analytic model would still be good enough. 
2) You don't have enough data, so the p-value is high, and you conclude that there is not enough evidence to reject the null hypothesis. But again, this doesn't tell you whether the model is good enough. It only tells you that you don't have enough data. 
Neither outcome helps with what I think is the real problem: deciding whether a particular model is good enough for the intended purpose.

You can read the rest of the conversation here.