Statistics: not lying is harder than you think

In order to make sense of the data I collect, I use statistics. The statistical tools available for data analysis these days are pretty incredible, leaps and bounds ahead of the simple, classical statistics like chi-square, which worked great – if you had perfect data.

Field biologists like me don’t have perfect data. We have really, really terrible data, from a statistical perspective. We have unbalanced sample sizes, measuring 15 birds here, 21 here, 9 there; we have data with weird things in common, like measurements from different groups of nestlings, some of which are siblings; and we always have tons of noise in our data – because it was weirdly rainy that year, and also hot, and also the oak trees put out more acorns than usual, and that one chick was from a runt egg, and…

Excuse me, I generate only AWESOME data.

So we need statistics that work with messy data. “Generalized” is generally a promising word for field biologists in statistics: it means that your data don’t have to be a perfect bell curve for the test to work. Generalized linear mixed effects models. Generalized additive mixed models. These are my friends – especially because several smart, kind souls have made it possible to do these tests in the program R without needing a Masters in statistics or computer science. Yay.

But. The more I work with statistics, and the more I learn about specific models, the more it becomes clear to me that the challenge now in statistics is not how to model your data in a way that gives you some sort of result; it’s how to choose a model that does not lie to you. Models don’t warn you when you use them incorrectly. They don’t say “Hey clown, this is count data and you just wrote ‘family = gaussian,’ wake up!” No, they say “Highly significant, wow, publish this in Nature immediately!” (Unfortunately, often modeling errors give you highly significant results. Nowadays when I see significant results, my first thought is “I bet I did something wrong…”)

I bet you did something wrong too.

There are a lot of things to potentially get wrong, and it’s much, much easier to not worry about all of them; to just figure the residuals look fine, to not investigate whether you’ve used the right link function. To not even realize that you have used a link function, because it’s a default setting that doesn’t show up unless you specify it.

But you have to do all of this, meticulously, because journal reviewers probably won’t catch your mistakes. As the models get more complex, it becomes harder for outsiders to tell whether the researcher used the right test or not. If the researcher describes a plausible test – or even, depending on the statistical literacy of the reviewers, if the researcher describes a complicated, impressively math-y sounding test – then it will probably be accepted. Reviewers don’t have time to demand to see your dataset and reanalyze it themselves. Nor should they have to: that’s your job.

That is my nightmare: to, unwittingly, publish results from an incorrect analysis. Even if it was an accident, it would still be lying to my field: promoting an untruth in this work where truth is the whole point.

(I also have nightmares where I’m chased by decomposing clowns, like a normal person. That’s just my career nightmare.)

My nightmare is to be eaten by a bigger fish. But sure, worry about putting the wrong numbers on a piece of paper.

So I’ve been reading a lot of statistical resources lately, trying not to do this. (Note to the people who write these resources: if you could put in fewer derivations of equations, and more things like “What a link function is in less than 100 words,” I would really appreciate it.)

To see why I’m nervous, here’s a quick example. I’ve been working with the measurements of junco bills that I’ve taken, trying to look at how bill size and shape change – or don’t – with elevation and time. Here are three graphs I’ve made of the fluctuation of bill width over time. (Y-axis is a standardized measure of bill width; x-axis is year.) These are all from pretty much exactly the same model, with just minor differences – a random effect added here, removed there. They should all look pretty much the same, right?

This version makes it look like maybe there’s a slight increase with time.

This version is wiggly!

This version is super wiggly!

I actually really love that last graph – it’s so squiggly! – but I don’t think it’s true. I’m not confident that any of these three are true. But they serve to demonstrate how important it is to use the right model, since minor model changes affect your conclusions so dramatically.

Statistics: the science of striving not to accidentally lie about your data.

7 thoughts on “Statistics: not lying is harder than you think”

kestrelart on January 28, 2013 at 1:30 PM said:

I’d love to see the data that generated such curves.
This is such a great post both because of the intellectual challenge and your integrity.
The first time I gave a painting of mine as a gift (gannets thronging their Yorkshire breeding colony), it was to my long time friend and scientific partner when she was awarded her professorial chair in biostatistics We have learned our trades together over what is now nearly two decades. Its been a blast. I have to model my experiments meaningfully amd know the limits on interpretation. I don’t want to squeeze out a spurious p value and wave it like a trophy.
So have fun and seduce a statistician (intellectually I mean). Reality never looks quite the same again.

Reply ↓
- toughlittlebirds on January 28, 2013 at 4:35 PM said:
  
  A statistically knowledgeable companion is so valuable! I used to have one in my office but he got another postdoc and moved to Canada. Sigh. I’m trying to become sufficiently knowledgeable myself to be able to help my labmates when they run into stats issues, but the subject is sort of fractal in complexity. But worth the difficulty, I think.
  
  Reply ↓
  - Patrick Kelley on February 25, 2013 at 11:32 AM said:
    
    Great post on nonlinear models, Katie. You raised by far the most important points about these devilish models! Another view is that these models (i.e. the additive models you used to model the above data) are in fact telling you the truth about the data, but they’re not telling you the truth about the phenomenon you’re trying to understand. Maybe we could look at statistics as having two main obstacles. The first obstacle would be to ensure that a model is appropriate for the data (i.e. understanding that averages usually stink for most data sets). The second is simply to make sure to be conservative when drawing inference from the result (i.e. decrease the squiggles) but also reassessing that inference when there are more data! Your post got me thinking about again some basic issues. Many thanks! As my data once told me…keep on collecting!
    
    Reply ↓
lylekrahn on January 28, 2013 at 4:59 PM said:

It is nice to see that you are trying so hard to keep it accurate. It’s easy for a lot of facts to go into an inaccurate conclusion.

Reply ↓
- toughlittlebirds on January 29, 2013 at 12:40 PM said:
  
  That’s a great phrase! “Easy for a lot of facts to go into an inaccurate conclusion.” That’s exactly the tricky part – you think you must be all right because your data are true, but it’s so easy to come up with an untruth anyway.
  
  Reply ↓
vanbraman on January 28, 2013 at 9:37 PM said:

In the medical field we use a lot of statistics. It amazes me what our researchers can do with their data, and glad that I am not the one having to crunch the numbers or validate them.

Reply ↓
- toughlittlebirds on January 29, 2013 at 12:42 PM said:
  
  I’m very glad I’m not working on medical issues; that raises the stakes so much. I like to imagine (hope) that medical people know how to use statistics perfectly and are never wrong…
  
  Reply ↓