American Society For Nutrition

Multiple comparisons: Avoiding error (but which error?)

Multiple comparisons: Avoiding error (but which error?)

Excellence in Nutrition Research and Practice
Posted on 12/28/2009 at 06:08:44 PM by Student Blogger
By: Matt T.

As you can see by the title, I'm returning to my nerdy stats-lovin' roots this month.

The idea behind multiple comparisons is straightforward. Follow this link and quickly click  the “Go!” button a dozen times times or so to get a feel for it.  What you're seeing are t-statistics for 20 different independent sample t-tests. Significant results are highlighted. Here's the trick: These tests are comparing completely random, normally distributed groups of numbers. Every time you see one of those little boxes light up like a Christmas light, that's a type I error – a “significant difference” even though the numbers are completely random.

This is one of the dirty little secrets of science: about 5% of all our “significant results” are wrong, and we don't know which 5%.

The problem becomes more serious when we test differences in multiple outcomes in one study (and we usually do). As above, if we test 20 different variables, we're likely to see at least one false positive.

To mitigate this, we use things like Bonferroni or Tukey adjustments, which are simply  ways to penalize our tests for the fact that we've made multiple comparisons. If we apply a correction, we would see a false positive only about 1 in 20 times we click that button.

As clear cut as all this seems so far, multiple comparisons can get cloudy in practice. Some say we should adjust for every test we perform in a particular study. I argue that this stringent use goes to far.

Suppose, for example, we measure the effect of two diets on bone health. Bone health is a broad and abstract term, with lots of ways to measure it. In my own lab, we measure bone density by dual x-ray absorptiometry (DXA), bone turnover by plasma markers (eg, BAP or bone specific alkaline phosphatase, an enzyme in plasma proportional to osteoblast activity) and bone structure by MRI. Each of these subcategories of bone health may have 3 or more measures. Because they all measure different facets of the same underlying phenomenon or “construct,” we would expect these measures to be strongly correlated.


A strict application of the concept of multiple comparisons would mandate that I penalize each of these measures for the total number of tests performed. To declare a diet a success I would need a p-value < 0.00625. (Wow!)

But take  a step back and ask what the extra information from multiple measurements is really telling us. If 9 different ways of measuring bone health each show the same trend shouldn't this strengthen our case instead of weakening it? If these variables are not independent, why should we require that each statistical test independently satisfy such a stringent requirement? Wouldn't it be better to test whether all these interrelated variables are affected by the diet together?

If we were going to conclude that bone health differed by diet based on the result of only one out of the 8 variables, then adjustment is necessary to keep the false positive rate down to 5%. This is what multiple comparisons were designed to do. However when several alternate measures of the same construct are moving in the same direction, it makes more sense to hold all these variables to the same standard jointly.

For ways to do this, we can take a tip from the social scientists. They are used to thinking about data as aggregate measures of an underlying, abstract construct, rather than a collection of unique measures. For example, we can perform a multivariate test (available in most stats programs) to test the hypothesis that all the related variables are jointly different between diets.


Performing this test on my 8 interrelated bone health measures casts light on the big picture: collectively, bone health improves significantly, because all of 8 aspects of it are improving (if not quite significantly on their own).

Multiple comparison adjustments are meant to keep us from changing our theories of how the world works based on one significant test among many. They are not meant to penalize us for collecting more data in the same experiment. More data should mean more perspective, and a clearer picture.  Choosing multivariate statistical approaches to make use of this extra information, instead of punishing us for it, is just good common sense.

1 Comment
Posted Jan 31, 2010 5:53 AM by Vivian Detopoulou

I liked this post very much, especially the link for type-I error! This post empasizes the usefulness of indices or component analysis!