Writing about web page http://www.pnas.org/content/early/2016/06/27/1602413113.full
In Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we report on a massive evaluation of statistical methods for task fMRI using resting state as a source of null data. We found very poor performance with one variant of cluster size inference, with a cluster defining threshold (CDT) of P=0.01 giving familywise error (FWE) rates of 50% or more when 5% was expected; CDT of P=0.001 was much better, sometimes within the confidence bounds of our evaluation but almost always above 5% (i.e. suggesting a bias). We subsequently devel into the reasons for these inaccuracies, finding heavy-tailed spatial correlations and spatially-varying smoothness to be the likely suspects. In contrast, non-parametric permutation performs ‘spot on’, only having some inaccuracies in one-sample t-tests, likely due to asymmetric distribution of the errors.
I’ve worked on comparisons of parametric and nonparametric inference methods for neuroimaging essentially my entire career (see refs in Eklund et al.). I’m especially pleased with this work for (1) the use of a massive library of resting-state fMRI for null realisations, finally having samples that reflect the complex spatial and temporal dependence structure of real data, and (2) getting to the bottom of why the the parametric methods don’t work.
However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.
The analysis code (in Matlab) is laid out below, but here is the skinny: Based on some reasonable probabilistic computations, but perhaps fragile samples of the literature, I estimate about 15,000 papers use cluster size inference with correction for multiple testing; of these, around 3,500 use a CDT of P=0.01. 3,500 is about 9% of the entire literature, or perhaps more usefully, 11% of papers containing original data. (Of course some of these 15,000 or 3,500 might use nonparametric inference, but it’s unfortunately rare for fMRI—in contrast, it’s the default inference tool for structural VBM/DTI analyses in FSL).
I frankly thought this number would be higher, but didn’t realise the large proportion of studies that never used any sort of multiple testing correction. (Can’t have inflated corrected significances if you don’t correct!). These calculations suggest 13,000 papers used no multiple testing correction. Of course some of these may be using regions of interest or sub-volume analyses, but it’s a scant few (i.e. clinical trial style outcome) that have absolutely no multiplicity at all. Our paper isn’t directly about this group, but for publications that used the folk multiple testing correction, P10, our paper shows this approach has familywise error rates well in excess of 50%.
So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.
My take homes from this exercise have been:
- No matter what method you’re using, if you go to town on a P-value on the razor’s edge of P=0.05000, you lean heavily on the assumptions of your method, and any perturbation of the data (or slight failure of the assumptions) would likely give you a result on the other side of the boundary. (This is a truism for all of science, but particularly for neuroimaging where we invariably use hard thresholds.)
- Meta-analysis is an essential tool to collate a population of studies, and can be used in this very setting when individual results have questionable reliability. In an ideal world, all studies, good and bad, would be published with full data sharing (see next 2 points), each clearly marked with their varying strengths of evidence (no correction < FDR voxel-wise < FDR cluster-wise < FWE cluster-wise < FWE voxel-wise). This rich pool of data, with no file drawer effects, could then be distilled with meta-analysis to see what effects stand up.
- Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! If all published studies’ T maps were available, we could revisit each analysis (approximately at least). The discipline of neuroimaging is embarrassingly behind genetics and bioinformatics, where SNP-by-SNP or gene-by-gene results (effect size, T-value, P-values) are shared by default. This is not a “data sharing” issue, this is a transparency of results issue… that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.
- Data sharing, must also come. With the actual full data, we could revisit each paper’s analysis, exactly, and, what’s more, in 5 years, revisit again with even better methods, or for more insightful (e.g. predictive) analyses.
The PNAS article has now been corrected:
- Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, PNAS 2016 0:1612033113v1-201612033; doi:10.1073/pnas.1612033113.
- Blog post, Errata for Cluster failure.
- The online PDF of the orignal article now reflects these corrections.
Nstudies=40000; % N studies in PubMed with "fMRI" in title/abstract  Pdata=0.80; % Prop. of studies actually containing data  Pcorr=0.59; % Prop. of studies correcting for multiple testing, among data studies  Pclus=0.79; % Prop. of cluster inference studies, among corrected studies  Pgte01=0.24; % Prop. of cluster-forming thresholds 0.01 or larger, among corrected cluster inference studies  % Number of studies using corrected cluster inference (higher P) Nstudies*Pdata*Pcorr*Pclus % 14,915 % Number of studies using corrected cluster inference with cluster defining threshold of 0.01 or lower (higher P) Nstudies*Pdata*Pcorr*Pclus*Pgte01 % 3,579 % Number of studies with original data not using a correction for multiple testing Nstudies*Pdata*(1-Pcorr) % 13,120
- 42,158 rounded down, from a Pubmed search for “((fmri[title/abstract] OR functional MRI[title/abstract]) OR functional Magnetic Resonance Imaging[title/abstract])” conducted 5 July 2016.
- Carp, 2012, literature 2007 – 2012, same search as in , with additional constraints, from which a random sample of 300 were selected. Of these 300, 59 excluded as not presenting original fMRI data, (300-59)/300=0.80.
- Carp, 2012, “Although a majority of studies (59%) reported the use of some variety of correction for multiple comparisons, a substantial minority did not.”
- Woo et al., 2014, papers Jan 2010 – Nov 2011, “fMRI” and “threshold”, 1500 papers screened, 814 included; of those 814, 607 used cluster thresholding, and 607/814=0.746 matching 75% in Fig 1. However, Fig 1 also shows that 6% of those studies had no correction. To match up with Carp’s use of corrected statistics, we thus revise this to 607/(814-0.06*814)=0.79
- Woo et al., data from author (below), cross-classifying 480 studies with sufficient detail to determine the cluster defining threshold. There are 35 studies with a CDT P>0.01, 80 using CDT P=0.01, giving 115/480=0.240.
AFNI BV FSL SPM OTHERS ____ __ ___ ___ ______ >.01 9 5 9 8 4 .01 9 4 44 20 3 .005 24 6 1 48 3 .001 13 20 11 206 5 <.001 2 5 3 16 2
Revised to add mention of the slice of the literature that actually does use nonparametric inference (thanks to Nathaniel Daw).
Initially failed to adjust for final comment in note #4; correction increased all the cluster numbers slightly (thanks to Alan Pickering).
Added a passage about possibility of clinical trial type studies, with truly no multiplicity.
Added take-home about meta-analysis.
Revised to reference PNAS correction. 16 August 2016.