Letter from the SCU archives

Letter from the SCU Archives

Dear SCU,

I have three mice in each of two treatment groups, and my supervisor tells me to compare the outcomes using a Mann-Whitney U test. The two groups of mice had very different outcomes, but the p-value from the Mann-Whitney U test is only 0.10. Why isn’t it significant?

Signed: Bad news from the Mouse House

Dear Bad News,

The Mann-Whitney U test, or indeed any nonparametric test, depends only upon the order of the data, rather than the actual data values. The set of possible outcomes (orders) are all the ways to order the outcomes from 2 groups. One such outcome is: (1,1,1,2,2,2) and another is (1,2,1,2,1,2). In total, there are 20 possible outcomes (exercise). The p-value is the probability that the outcome (or something more extreme) occurred when only chance is at play.  What do we mean by “what happens by chance”? In this case, it means that each outcome (order) is equally likely. With 20 possible outcomes, each outcome has a probability of 1/20 = 0.05.

What do we mean by an “extreme” outcome? When the two groups behave differently, we expect that the groups cluster (e.g. (1,1,1,2,2,2). On the other hand, when two groups have similar outcomes, we expect the groups to be intermingled (e.g. (1,2,1,2,1,2)). The “extremes” are the outcomes that best reflect differences between groups, in particular (1,1,1,2,2,2) and (2,2,2,1,1,1).  The p-value corresponding to the most extreme outcomes is 2/20 or 0.10.

 

Figure 1. The treatment group has consistently larger values than the control, so the order configuration is (C,C,C,T,T,T). Using the Mann-Whitney test, the associated p-value =0.10. Clearly more observations are needed for this outcome to be considered “statistically significant”.

When you plan to use non-parametric tests to analyse your data, you should consider how many sample points you need. Three will never be enough, since the p-value is at a minimum 0.10. With 4 mice in each group, one can compute that there are 70 possible outcomes, and the p-value corresponding to the most extreme possible outcome is 2/70 or 0.029. But with even a little bit of overlap between the two groups (e.g. (1,1,1,2,1,2,2,2), the p-value will exceed 0.05.

But let’s get beyond p-values. You run your experiment with 3 mice per treatment, but you decide to repeat the same experiment. Then you’ll have 6 mice per treatment. Suppose you get a similar outcome. Then surely the two experiments together provide more evidence of an effect, than each single experiment on its own. Now a p-value that measures the evidence from two experiments should be smaller than the p-values from single experiments with a consistent effect.

But it would be naïve to simply ignore the fact that the outcomes come from separate experiments. There may be an overall difference between the experiments because of the different conditions under which they were run, for example: different reagents, different litters, or different cage conditions.  Imagine that you get the best possible outcome each time: (1,1,1,2,2,2), but when you naively put experiments together, the ordering becomes: (1,1,1,2,2,1,2,1,1,2,2,2). This no longer looks like an “extreme” outcome that you can use to infer that group 1 has worse outcomes than group 2. To get a meaningful p-value, you’ll need to go beyond simple non-parametric tests. You’ll need to take into account data structure. I suggest you talk to a consultant at the SCU about your current experiment, and your plans for future replications of this experiment.

Figure 2. Here are two consistent experiments, both of which indicate that the treatment group has larger values than the control group. However, the treated group in the experiment on the right tends to have lower values than the control group on the left. The relevant contrasts, however, are the contrasts within experiments.

By the way, how did you arrange the mice in the cages? I hope you didn’t have only 2 cages: one cage per treatment! Please consider that there may be cage effects, which will be confounded with the group effects. In particular, how will you know if the observed group difference is due to the cage or to the treatment?   

With best wishes,

The Statistical Consulting Unit (SCU)

about this site Updated: 4 February 2015/ Responsible Officer:  scu / Page Contact:  webmaster