It’s every scientist’s worst nightmare: six papers retracted in a single day, complete with a press release that’s helping the world’s science reporters disseminate and discuss the news.
That’s exactly what happened Wednesday at the journal network JAMA, and to the Cornell researcher Brian Wansink.
Wansink has been the director of Cornell’s Food and Brand Lab. For years, he has been known as a “world-renowned eating behavior expert.”
On Thursday, Cornell announced that a faculty committee found Wansink “committed academic misconduct,” and that he would retire from the university on June 30, 2019. In the meantime, Wansink “has been removed from all teaching and research,” Cornell University provost Michael Kotlikoff said in a statement. Wansink will spend his remaining time at the university cooperating in an “ongoing review of his prior research.”
Even if you’ve never heard of Wansink, you’re probably familiar with his ideas. His studies, cited more than 20,000 times, are about how our environment shapes how we think about food, and what we end up consuming. He’s one of the reasons Big Food companies started offering smaller snack packaging, in 100 calorie portions. He once led the USDA committee on dietary guidelines and influenced public policy. He helped Google and the US Army implement programs to encourage healthy eating.
But over the past couple years, the scientific house of cards that underpinned this work and influence has started crumbling. A cadre of skeptical researchers and journalists, including BuzzFeed’s Stephanie Lee, have taken a close look at Wansink’s food psychology research unit, the Food and Brand Lab at Cornell University, and have shown that unsavory data manipulation ran rampant there.
Thirteen of Wansink’s studies have now been retracted, including the six pulled from JAMA Wednesday. Among them: studies suggesting people who grocery shop hungry buy more calories; that preordering lunch can help you choose healthier food; and that serving people out of large bowls encourage them to serve themselves larger portions.
In a press release, JAMA said Cornell couldn’t “provide assurances regarding the scientific validity of the 6 studies” because they didn’t have access to Wansink’s original data. So, Wansink’s ideas aren’t necessarily wrong, but he didn’t provide credible evidence for them.
According to the Cornell provost, Wansink’s academic misconduct included “the misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship.”
But this story is a lot bigger than any single researcher. It’s important because it helps shine a light on persistent problems in science that have existed in labs across the world, problems that science reformers are increasingly calling for action on. Here’s what you need to know.
Wansink had a knack for producing studies that were catnip for the media, including us here at Vox. In 2009, Wansink and a co-author published a study that went viral that suggested the Joy of Cooking cookbook (and others like it) was contributing to America’s growing waistline. It found that recipes in more recent editions of the tome — which has sold more than 18 million copies since 1936 — contain more calories and larger serving sizes compared to its earliest editions.
The study focused on 18 classic recipes that have appeared in Joy of Cooking since 1936 and found that their average calorie density had increased by 35 percent per serving over the years.
There was also Wansink’s famous “bottomless bowls” study, which concluded that people will mindlessly guzzle down soup as long as their bowls are automatically refilled, and his “bad popcorn” study, which demonstrated that we’ll gobble up stale and unpalatable food when it’s presented to us in huge quantities.
Together, they helped Wansink reinforce his larger research agenda focused on how the decisions we make about what we eat and how we live are very much shaped by environmental cues.
The critical inquiry into his work started in 2016 when Wansink published a blog post in which he inadvertently admitted to encouraging his graduate students to engage in questionable research practices. Since then, scientists have been combing through his body of work and looking for errors, inconsistencies, and general fishiness. And they’ve uncovered dozens of head-scratchers.
In more than one instance, Wansink misidentified the ages of participants in published studies, mixing up children ages 8 to 11 with toddlers. In sum, the collective efforts have led to a whole dossier of troublesome findings in Wansink’s work.
To date, 13 of his papers have been retracted. And that’s stunning given that Wansink was so highly cited and his body of work was so influential. Wansink also collected government grants, helped shape the marketing practices at food companies, and worked with the White House to influence food policy in this country.
Among the biggest problems in science that the Wansink debacle exemplifies is the “publish or perish” mentality.
To be more competitive for grants, scientists have to publish their research in respected scientific journals. For their work to be accepted by these journals, they need positive (i.e., statistically significant) results.
That puts pressure on labs like Wansink’s to do what’s known as p-hacking. The “p” stands for p-values, a measure of statistical significance. Typically, researchers hope their results yield a p-value of less than .05 — the cutoff beyond which they can call their results significant.
P-values are a bit complicated to explain (as we do here and here). But basically: They’re a tool to help researchers understand how rare their results are. If the results are super rare, scientists can feel more confident their hypothesis is correct.
Here’s the thing: P-values of .05 aren’t that hard to find if you sort the data differently or perform a huge number of analyses. In flipping coins, you’d think it would be rare to get 10 heads in a row. You might start to suspect the coin is weighted to favor heads and that the result is statistically significant.
But what if you just got 10 heads in a row by chance (it can happen) and then suddenly decided you were done flipping coins? If you kept going, you’d stop believing the coin is weighted.
Stopping an experiment when a p-value of .05 is achieved is an example of p-hacking. But there are other ways to do it — like collecting data on a large number of outcomes but only reporting the outcomes that achieve statistical significance. By running many analyses, you’re bound to find something significant just by chance alone.
According to BuzzFeed’s Lee, who obtained Wansink’s emails, instead of testing a hypothesis and reporting on whatever findings he came to, Wansink often encouraged his underlings to crunch data in ways that would yield more interesting or desirable results.
In effect, he was running a p-hacking operation — or as one researcher, Stanford’s Kristin Sainani, told BuzzFeed, “p-hacking on steroids.”
Wansink’s sloppiness and exaggerations may be greater than ordinary. But many, many researchers have admitted to engaging in some form of p-hacking in their careers.
A 2012 survey of 2,000 psychologists found p-hacking tactics were commonplace. Fifty percent admitted to only reporting studies that panned out (ignoring data that was inconclusive). Around 20 percent admitted to stopping data collection after they got the result they were hoping for. Most of the respondents thought their actions were defensible. Many thought p-hacking was a way to find the real signal in all the noise.
But they haven’t. Increasingly, even textbook studies and phenomena are coming undone as researchers retest them with more rigorous designs.
There’s a movement of scientists who seek to rectify practices in science like the ones that Wansink is accused of. Together, they basically call for three main fixes that are gaining momentum.
- Preregistration of study designs: This is a huge safeguard against p-hacking. Preregistration means that scientists publicly commit to an experiment’s design before they start collecting data. This makes it much harder to cherry-pick results.
- Open data sharing: Increasingly, scientists are calling on their colleagues to make all the data from their experiments available for anyone to scrutinize (there are exceptions, of course, for particularly sensitive information). This ensures that shoddy research that makes it through peer review can still be double-checked.
- Registered replication reports: Scientists are hungry to see if previously reporting findings in the academic literature hold up under more intense scrutiny. There are many efforts underway to replicate (exactly or conceptually) research findings with rigor.
There are other potential fixes too: There’s a group of scientists calling for a stricter definition of statistically significant. Others argue that arbitrary cutoffs for significance are always going to be gamed. And increasingly, scientists are turning to other forms of mathematical analysis, such as Bayesian statistics, which asks a slightly different question of data. (While p-values ask, “How rare are these numbers?” a Bayesian approach asks, “What’s the probability my hypothesis is the best explanation for the results we’ve found?”)
No one solution will be the panacea. And it’s important to recognize that science has to grapple with a much more fundamental problem: its culture.
In 2016, Vox sent out a survey to more than 200 scientists asking, “If you could change one thing about how science works today, what would it be and why?” One of the clear themes in the responses: The institutions of science need to get better at rewarding failure instead of prizing publication above all else.
One young scientist told us, “I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.”
Brian Wansink faced the same dilemma. And it’s increasingly clear which path he chose.