Errata for Cluster failure

Follow-up to Bibliometrics of Cluster Inference from the NISOx blog.

Given the widespread misinterpretation of our paper, Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we filed an errata with the PNAS Editoral office:

Errata for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.
Eklund, Anders; Nichols, Thomas E; Knutsson, Hans

Two sentences were poorly worded and could easily be misunderstood as overstating our results.

The last sentence of the Significance statement should read: “These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.”

The first sentence after the heading “The future of fMRI” should have read: “Due to lamentable archiving and data-sharing practices it is unlikely that problematic analyses can be redone.”

These replace the two sentences that mistakenly implied that our work affected all 40,000 publications (see Bibliometrics of Cluster Inference for an guestimate of how much of the literature is potentially affected).

After initially declining the the errata, on the grounds that it was correcting interpretation and not fact, PNAS have agreed to publish it:

  • Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, PNAS 2016 0:1612033113v1-201612033; doi:10.1073/pnas.1612033113.

The online PDF of the orignal article now reflects these corrections.

Update.

Revised to reflect PNAS acceptance of errata. 19 July 2016.
Revised to reference PNAS correction. 16 August 2016.

Bibliometrics of Cluster Inference

Writing about web page http://www.pnas.org/content/early/2016/06/27/1602413113.full

In Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we report on a massive evaluation of statistical methods for task fMRI using resting state as a source of null data. We found very poor performance with one variant of cluster size inference, with a cluster defining threshold (CDT) of P=0.01 giving familywise error (FWE) rates of 50% or more when 5% was expected; CDT of P=0.001 was much better, sometimes within the confidence bounds of our evaluation but almost always above 5% (i.e. suggesting a bias). We subsequently devel into the reasons for these inaccuracies, finding heavy-tailed spatial correlations and spatially-varying smoothness to be the likely suspects. In contrast, non-parametric permutation performs ‘spot on’, only having some inaccuracies in one-sample t-tests, likely due to asymmetric distribution of the errors.

I’ve worked on comparisons of parametric and nonparametric inference methods for neuroimaging essentially my entire career (see refs in Eklund et al.). I’m especially pleased with this work for (1) the use of a massive library of resting-state fMRI for null realisations, finally having samples that reflect the complex spatial and temporal dependence structure of real data, and (2) getting to the bottom of why the the parametric methods don’t work.

However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.

The analysis code (in Matlab) is laid out below, but here is the skinny: Based on some reasonable probabilistic computations, but perhaps fragile samples of the literature, I estimate about 15,000 papers use cluster size inference with correction for multiple testing; of these, around 3,500 use a CDT of P=0.01. 3,500 is about 9% of the entire literature, or perhaps more usefully, 11% of papers containing original data. (Of course some of these 15,000 or 3,500 might use nonparametric inference, but it’s unfortunately rare for fMRI—in contrast, it’s the default inference tool for structural VBM/DTI analyses in FSL).

I frankly thought this number would be higher, but didn’t realise the large proportion of studies that never used any sort of multiple testing correction. (Can’t have inflated corrected significances if you don’t correct!). These calculations suggest 13,000 papers used no multiple testing correction. Of course some of these may be using regions of interest or sub-volume analyses, but it’s a scant few (i.e. clinical trial style outcome) that have absolutely no multiplicity at all. Our paper isn’t directly about this group, but for publications that used the folk multiple testing correction, P10, our paper shows this approach has familywise error rates well in excess of 50%.

So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.

My take homes from this exercise have been:

  • No matter what method you’re using, if you go to town on a P-value on the razor’s edge of P=0.05000, you lean heavily on the assumptions of your method, and any perturbation of the data (or slight failure of the assumptions) would likely give you a result on the other side of the boundary. (This is a truism for all of science, but particularly for neuroimaging where we invariably use hard thresholds.)
  • Meta-analysis is an essential tool to collate a population of studies, and can be used in this very setting when individual results have questionable reliability. In an ideal world, all studies, good and bad, would be published with full data sharing (see next 2 points), each clearly marked with their varying strengths of evidence (no correction < FDR voxel-wise < FDR cluster-wise < FWE cluster-wise < FWE voxel-wise). This rich pool of data, with no file drawer effects, could then be distilled with meta-analysis to see what effects stand up.
  • Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! If all published studies’ T maps were available, we could revisit each analysis (approximately at least). The discipline of neuroimaging is embarrassingly behind genetics and bioinformatics, where SNP-by-SNP or gene-by-gene results (effect size, T-value, P-values) are shared by default. This is not a “data sharing” issue, this is a transparency of results issue… that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.
  • Data sharing, must also come. With the actual full data, we could revisit each paper’s analysis, exactly, and, what’s more, in 5 years, revisit again with even better methods, or for more insightful (e.g. predictive) analyses.

Update

The PNAS article has now been corrected:

Nstudies=40000;   % N studies in PubMed with "fMRI" in title/abstract [1]
Pdata=0.80;       % Prop. of studies actually containing data [2]
Pcorr=0.59;       % Prop. of studies correcting for multiple testing, among data studies [3]
Pclus=0.79;       % Prop. of cluster inference studies, among corrected studies [4]
Pgte01=0.24;      % Prop. of cluster-forming thresholds 0.01 or larger, among corrected cluster inference studies [5]

% Number of studies using corrected  cluster inference (higher P)
Nstudies*Pdata*Pcorr*Pclus        % 14,915

% Number of studies using corrected cluster inference with cluster defining threshold of 0.01 or lower (higher P)
Nstudies*Pdata*Pcorr*Pclus*Pgte01 %  3,579

% Number of studies with original data not using a correction for multiple testing
Nstudies*Pdata*(1-Pcorr)          % 13,120 

Notes

  1. 42,158 rounded down, from a Pubmed search for “((fmri[title/abstract] OR functional MRI[title/abstract]) OR functional Magnetic Resonance Imaging[title/abstract])” conducted 5 July 2016.
  2. Carp, 2012, literature 2007 – 2012, same search as in [1], with additional constraints, from which a random sample of 300 were selected. Of these 300, 59 excluded as not presenting original fMRI data, (300-59)/300=0.80.
  3. Carp, 2012, “Although a majority of studies (59%) reported the use of some variety of correction for multiple comparisons, a substantial minority did not.”
  4. Woo et al., 2014, papers Jan 2010 – Nov 2011, “fMRI” and “threshold”, 1500 papers screened, 814 included; of those 814, 607 used cluster thresholding, and 607/814=0.746 matching 75% in Fig 1. However, Fig 1 also shows that 6% of those studies had no correction. To match up with Carp’s use of corrected statistics, we thus revise this to 607/(814-0.06*814)=0.79
  5. Woo et al., data from author (below), cross-classifying 480 studies with sufficient detail to determine the cluster defining threshold. There are 35 studies with a CDT P>0.01, 80 using CDT P=0.01, giving 115/480=0.240.
            AFNI     BV    FSL    SPM   OTHERS
            ____     __    ___    ___   ______

    >.01      9       5     9       8    4     
    .01       9       4    44      20    3     
    .005     24       6     1      48    3     
    .001     13      20    11     206    5     
    <.001     2       5     3      16    2     

Revisions

Revised to add mention of the slice of the literature that actually does use nonparametric inference (thanks to Nathaniel Daw).

Initially failed to adjust for final comment in note #4; correction increased all the cluster numbers slightly (thanks to Alan Pickering).

Added a passage about possibility of clinical trial type studies, with truly no multiplicity.

Added take-home about meta-analysis.

Revised to reference PNAS correction. 16 August 2016.

Unix find tricks

I will complete a proper, reasoned, motivated blog entry on the find command one day, but for now this just records the essential commands for some (seemingly simple) commands.


Find all files in the current directory but ignore any directories (and their children) called .git:

find . -name .git -prune -o -type f -print

And how I originally used this, show me the word count for each such file:

find . -name .git -prune -o -type f -exec wc {} \;


Count lines, but only for text files:

find . -type f -exec grep -Iq . {} \; -and -exec wc {} \;


Bulk histograms with MassHist.m

When doing a quality control check of image data, one useful thing to look at is the histogram of the image intensities. When presented with masses of images, it’s useful to look at one histogram for each image in a big panel.

This MassHist.m script (for SPM) will print to PDF a ‘gang’ of histograms for all the images provided. The lay out has 2 columns and 5 rows on a portrait-oriented page, but these are easily changed in the code.

MassHist.m

Linux Access Control Lists

Call me a dinosaur, but I really liked the ability of the Andrew File System to give fine tuned control of access to directories. It allowed you to grant other users administrator rights, as well as rights to only insert files, or only delete files, etc. However, I digress.

The purpose of this post is provide a concise ‘cheat sheet’ on the commands needed to use Linux Access Control Lists (ACLs) with the setfacl and getfacl commands.

Show permissions for a file/directory
getfacl Afile Adirectory

Grant a user read, write and execute permissions to a file/directory
setfacl -m UserID:rwx Adirectory Afile

Grant a user only read and execute (traverse) permissions to a file (directory)
setfacl -m UserID:rx Adirectory Afile

Grant a user read, write and execute permissions, recursively, to a directory
setfacl -R -m UserID:rwx Adirectory

Set the default permissions for a directory, so that any new files or directories created in that directory inherit those permissions.
setfacl -m d:UserID:rwx Adirectory

Set both the permissions and default permissions at the same time.
setfacl -m UserID:rwx,d:UserID:rwx Adirectory

Remove the permissions for user
setfacl -x UserID Adirectory

Need to remove default permissions separately, or can combine it
setfacl -x d:UserID Adirectory
setfacl -x UserID,d:UserID Adirectory

In a draconian fashion, over-rule all other permission settings on this file/directory, and ensure no one has write permissions (set ‘mask’)
setfacl -m m::w Adirectory Afile

Mac Goodness (3)

Follow-up to Mac Goodness from the NISOx blog.

#15 More Terminal Bliss: Killing Terminal’s Quit

As should be clear from my other entries I depend on the Mac Terminal program extensively, often having 20 or more windows/tabs open at once. Then you can understand the horror when you accidentally hit CMD-Q, and all those precious Shell sessions die!

The solution was to remove the CMD-Q shortcut, with this gem:

defaults write com.apple.Terminal NSUserKeyEquivalents -dict-add "Quit Terminal" nil

Mac Goodness (2)

Follow-up to Mac Goodness from the NISOx blog.

14. Maximising Terminal Bliss in Mavericks.

For some reason with the Mavericks update, the terminal now copies text to the clipboard with the format of your current terminal font/colors, etc. This is absurd, because the terminal is text, plain and simple. Moreover, I often use the terminal as a clipboard, pasting in some text to get rid of some formatting.

Thankfully I found this defaults mod to change Terminal’s behaviour, so that when you copy to the clipboard all that goes is the text and no formatting:

defaults write com.apple.Terminal CopyAttributesProfile com.apple.Terminal.no-attributes