In order to make sense of the data I collect, I use statistics. The statistical tools available for data analysis these days are pretty incredible, leaps and bounds ahead of the simple, classical statistics like chi-square, which worked great – if you had perfect data.
Field biologists like me don’t have perfect data. We have really, really terrible data, from a statistical perspective. We have unbalanced sample sizes, measuring 15 birds here, 21 here, 9 there; we have data with weird things in common, like measurements from different groups of nestlings, some of which are siblings; and we always have tons of noise in our data – because it was weirdly rainy that year, and also hot, and also the oak trees put out more acorns than usual, and that one chick was from a runt egg, and…
So we need statistics that work with messy data. “Generalized” is generally a promising word for field biologists in statistics: it means that your data don’t have to be a perfect bell curve for the test to work. Generalized linear mixed effects models. Generalized additive mixed models. These are my friends – especially because several smart, kind souls have made it possible to do these tests in the program R without needing a Masters in statistics or computer science. Yay.
But. The more I work with statistics, and the more I learn about specific models, the more it becomes clear to me that the challenge now in statistics is not how to model your data in a way that gives you some sort of result; it’s how to choose a model that does not lie to you. Models don’t warn you when you use them incorrectly. They don’t say “Hey clown, this is count data and you just wrote ‘family = gaussian,’ wake up!” No, they say “Highly significant, wow, publish this in Nature immediately!” (Unfortunately, often modeling errors give you highly significant results. Nowadays when I see significant results, my first thought is “I bet I did something wrong…”)
There are a lot of things to potentially get wrong, and it’s much, much easier to not worry about all of them; to just figure the residuals look fine, to not investigate whether you’ve used the right link function. To not even realize that you have used a link function, because it’s a default setting that doesn’t show up unless you specify it.
But you have to do all of this, meticulously, because journal reviewers probably won’t catch your mistakes. As the models get more complex, it becomes harder for outsiders to tell whether the researcher used the right test or not. If the researcher describes a plausible test – or even, depending on the statistical literacy of the reviewers, if the researcher describes a complicated, impressively math-y sounding test – then it will probably be accepted. Reviewers don’t have time to demand to see your dataset and reanalyze it themselves. Nor should they have to: that’s your job.
That is my nightmare: to, unwittingly, publish results from an incorrect analysis. Even if it was an accident, it would still be lying to my field: promoting an untruth in this work where truth is the whole point.
(I also have nightmares where I’m chased by decomposing clowns, like a normal person. That’s just my career nightmare.)
So I’ve been reading a lot of statistical resources lately, trying not to do this. (Note to the people who write these resources: if you could put in fewer derivations of equations, and more things like “What a link function is in less than 100 words,” I would really appreciate it.)
To see why I’m nervous, here’s a quick example. I’ve been working with the measurements of junco bills that I’ve taken, trying to look at how bill size and shape change – or don’t – with elevation and time. Here are three graphs I’ve made of the fluctuation of bill width over time. (Y-axis is a standardized measure of bill width; x-axis is year.) These are all from pretty much exactly the same model, with just minor differences – a random effect added here, removed there. They should all look pretty much the same, right?
I actually really love that last graph – it’s so squiggly! – but I don’t think it’s true. I’m not confident that any of these three are true. But they serve to demonstrate how important it is to use the right model, since minor model changes affect your conclusions so dramatically.
Statistics: the science of striving not to accidentally lie about your data.