Errata for Cluster failure

Follow-up to Bibliometrics of Cluster Inference from the NISOx blog.

Given the widespread misinterpretation of our paper, Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we filed an errata with the PNAS Editoral office:

Errata for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.
Eklund, Anders; Nichols, Thomas E; Knutsson, Hans

Two sentences were poorly worded and could easily be misunderstood as overstating our results.

The last sentence of the Significance statement should read: “These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.”

The first sentence after the heading “The future of fMRI” should have read: “Due to lamentable archiving and data-sharing practices it is unlikely that problematic analyses can be redone.”

These replace the two sentences that mistakenly implied that our work affected all 40,000 publications (see Bibliometrics of Cluster Inference for an guestimate of how much of the literature is potentially affected).

After initially declining the the errata, on the grounds that it was correcting interpretation and not fact, PNAS have agreed to publish it:

  • Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, PNAS 2016 0:1612033113v1-201612033; doi:10.1073/pnas.1612033113.

The online PDF of the orignal article now reflects these corrections.

Update.

Revised to reflect PNAS acceptance of errata. 19 July 2016.
Revised to reference PNAS correction. 16 August 2016.

Bibliometrics of Cluster Inference

Writing about web page http://www.pnas.org/content/early/2016/06/27/1602413113.full

In Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we report on a massive evaluation of statistical methods for task fMRI using resting state as a source of null data. We found very poor performance with one variant of cluster size inference, with a cluster defining threshold (CDT) of P=0.01 giving familywise error (FWE) rates of 50% or more when 5% was expected; CDT of P=0.001 was much better, sometimes within the confidence bounds of our evaluation but almost always above 5% (i.e. suggesting a bias). We subsequently devel into the reasons for these inaccuracies, finding heavy-tailed spatial correlations and spatially-varying smoothness to be the likely suspects. In contrast, non-parametric permutation performs ‘spot on’, only having some inaccuracies in one-sample t-tests, likely due to asymmetric distribution of the errors.

I’ve worked on comparisons of parametric and nonparametric inference methods for neuroimaging essentially my entire career (see refs in Eklund et al.). I’m especially pleased with this work for (1) the use of a massive library of resting-state fMRI for null realisations, finally having samples that reflect the complex spatial and temporal dependence structure of real data, and (2) getting to the bottom of why the the parametric methods don’t work.

However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.

The analysis code (in Matlab) is laid out below, but here is the skinny: Based on some reasonable probabilistic computations, but perhaps fragile samples of the literature, I estimate about 15,000 papers use cluster size inference with correction for multiple testing; of these, around 3,500 use a CDT of P=0.01. 3,500 is about 9% of the entire literature, or perhaps more usefully, 11% of papers containing original data. (Of course some of these 15,000 or 3,500 might use nonparametric inference, but it’s unfortunately rare for fMRI—in contrast, it’s the default inference tool for structural VBM/DTI analyses in FSL).

I frankly thought this number would be higher, but didn’t realise the large proportion of studies that never used any sort of multiple testing correction. (Can’t have inflated corrected significances if you don’t correct!). These calculations suggest 13,000 papers used no multiple testing correction. Of course some of these may be using regions of interest or sub-volume analyses, but it’s a scant few (i.e. clinical trial style outcome) that have absolutely no multiplicity at all. Our paper isn’t directly about this group, but for publications that used the folk multiple testing correction, P10, our paper shows this approach has familywise error rates well in excess of 50%.

So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.

My take homes from this exercise have been:

  • No matter what method you’re using, if you go to town on a P-value on the razor’s edge of P=0.05000, you lean heavily on the assumptions of your method, and any perturbation of the data (or slight failure of the assumptions) would likely give you a result on the other side of the boundary. (This is a truism for all of science, but particularly for neuroimaging where we invariably use hard thresholds.)
  • Meta-analysis is an essential tool to collate a population of studies, and can be used in this very setting when individual results have questionable reliability. In an ideal world, all studies, good and bad, would be published with full data sharing (see next 2 points), each clearly marked with their varying strengths of evidence (no correction < FDR voxel-wise < FDR cluster-wise < FWE cluster-wise < FWE voxel-wise). This rich pool of data, with no file drawer effects, could then be distilled with meta-analysis to see what effects stand up.
  • Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! If all published studies’ T maps were available, we could revisit each analysis (approximately at least). The discipline of neuroimaging is embarrassingly behind genetics and bioinformatics, where SNP-by-SNP or gene-by-gene results (effect size, T-value, P-values) are shared by default. This is not a “data sharing” issue, this is a transparency of results issue… that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.
  • Data sharing, must also come. With the actual full data, we could revisit each paper’s analysis, exactly, and, what’s more, in 5 years, revisit again with even better methods, or for more insightful (e.g. predictive) analyses.

Update

The PNAS article has now been corrected:

Nstudies=40000;   % N studies in PubMed with "fMRI" in title/abstract [1]
Pdata=0.80;       % Prop. of studies actually containing data [2]
Pcorr=0.59;       % Prop. of studies correcting for multiple testing, among data studies [3]
Pclus=0.79;       % Prop. of cluster inference studies, among corrected studies [4]
Pgte01=0.24;      % Prop. of cluster-forming thresholds 0.01 or larger, among corrected cluster inference studies [5]

% Number of studies using corrected  cluster inference (higher P)
Nstudies*Pdata*Pcorr*Pclus        % 14,915

% Number of studies using corrected cluster inference with cluster defining threshold of 0.01 or lower (higher P)
Nstudies*Pdata*Pcorr*Pclus*Pgte01 %  3,579

% Number of studies with original data not using a correction for multiple testing
Nstudies*Pdata*(1-Pcorr)          % 13,120 

Notes

  1. 42,158 rounded down, from a Pubmed search for “((fmri[title/abstract] OR functional MRI[title/abstract]) OR functional Magnetic Resonance Imaging[title/abstract])” conducted 5 July 2016.
  2. Carp, 2012, literature 2007 – 2012, same search as in [1], with additional constraints, from which a random sample of 300 were selected. Of these 300, 59 excluded as not presenting original fMRI data, (300-59)/300=0.80.
  3. Carp, 2012, “Although a majority of studies (59%) reported the use of some variety of correction for multiple comparisons, a substantial minority did not.”
  4. Woo et al., 2014, papers Jan 2010 – Nov 2011, “fMRI” and “threshold”, 1500 papers screened, 814 included; of those 814, 607 used cluster thresholding, and 607/814=0.746 matching 75% in Fig 1. However, Fig 1 also shows that 6% of those studies had no correction. To match up with Carp’s use of corrected statistics, we thus revise this to 607/(814-0.06*814)=0.79
  5. Woo et al., data from author (below), cross-classifying 480 studies with sufficient detail to determine the cluster defining threshold. There are 35 studies with a CDT P>0.01, 80 using CDT P=0.01, giving 115/480=0.240.
            AFNI     BV    FSL    SPM   OTHERS
            ____     __    ___    ___   ______

    >.01      9       5     9       8    4     
    .01       9       4    44      20    3     
    .005     24       6     1      48    3     
    .001     13      20    11     206    5     
    <.001     2       5     3      16    2     

Revisions

Revised to add mention of the slice of the literature that actually does use nonparametric inference (thanks to Nathaniel Daw).

Initially failed to adjust for final comment in note #4; correction increased all the cluster numbers slightly (thanks to Alan Pickering).

Added a passage about possibility of clinical trial type studies, with truly no multiplicity.

Added take-home about meta-analysis.

Revised to reference PNAS correction. 16 August 2016.

Unix find tricks

I will complete a proper, reasoned, motivated blog entry on the find command one day, but for now this just records the essential commands for some (seemingly simple) commands.


Find all files in the current directory but ignore any directories (and their children) called .git:

find . -name .git -prune -o -type f -print

And how I originally used this, show me the word count for each such file:

find . -name .git -prune -o -type f -exec wc {} \;


Count lines, but only for text files:

find . -type f -exec grep -Iq . {} \; -and -exec wc {} \;


Bulk histograms with MassHist.m

When doing a quality control check of image data, one useful thing to look at is the histogram of the image intensities. When presented with masses of images, it’s useful to look at one histogram for each image in a big panel.

This MassHist.m script (for SPM) will print to PDF a ‘gang’ of histograms for all the images provided. The lay out has 2 columns and 5 rows on a portrait-oriented page, but these are easily changed in the code.

MassHist.m

Linux Access Control Lists

Call me a dinosaur, but I really liked the ability of the Andrew File System to give fine tuned control of access to directories. It allowed you to grant other users administrator rights, as well as rights to only insert files, or only delete files, etc. However, I digress.

The purpose of this post is provide a concise ‘cheat sheet’ on the commands needed to use Linux Access Control Lists (ACLs) with the setfacl and getfacl commands.

Show permissions for a file/directory
getfacl Afile Adirectory

Grant a user read, write and execute permissions to a file/directory
setfacl -m UserID:rwx Adirectory Afile

Grant a user only read and execute (traverse) permissions to a file (directory)
setfacl -m UserID:rx Adirectory Afile

Grant a user read, write and execute permissions, recursively, to a directory
setfacl -R -m UserID:rwx Adirectory

Set the default permissions for a directory, so that any new files or directories created in that directory inherit those permissions.
setfacl -m d:UserID:rwx Adirectory

Set both the permissions and default permissions at the same time.
setfacl -m UserID:rwx,d:UserID:rwx Adirectory

Remove the permissions for user
setfacl -x UserID Adirectory

Need to remove default permissions separately, or can combine it
setfacl -x d:UserID Adirectory
setfacl -x UserID,d:UserID Adirectory

In a draconian fashion, over-rule all other permission settings on this file/directory, and ensure no one has write permissions (set ‘mask’)
setfacl -m m::w Adirectory Afile

Mac Goodness (3)

Follow-up to Mac Goodness from the NISOx blog.

#15 More Terminal Bliss: Killing Terminal’s Quit

As should be clear from my other entries I depend on the Mac Terminal program extensively, often having 20 or more windows/tabs open at once. Then you can understand the horror when you accidentally hit CMD-Q, and all those precious Shell sessions die!

The solution was to remove the CMD-Q shortcut, with this gem:

defaults write com.apple.Terminal NSUserKeyEquivalents -dict-add "Quit Terminal" nil

Mac Goodness (2)

Follow-up to Mac Goodness from the NISOx blog.

14. Maximising Terminal Bliss in Mavericks.

For some reason with the Mavericks update, the terminal now copies text to the clipboard with the format of your current terminal font/colors, etc. This is absurd, because the terminal is text, plain and simple. Moreover, I often use the terminal as a clipboard, pasting in some text to get rid of some formatting.

Thankfully I found this defaults mod to change Terminal’s behaviour, so that when you copy to the clipboard all that goes is the text and no formatting:

defaults write com.apple.Terminal CopyAttributesProfile com.apple.Terminal.no-attributes

Apple Notes Backup

Apple products are amazing when they work as intended. They are so well designed that they become integrated into your life and then you become dependent upon them. So when they fail to work it is source of enormous frustration.

Having lost the ability to sync Notes when iOS 7 came out (using my not-newest iphone 4S), when Notes functionality returned I suddenly had growing number of duplicate copies of essential (i.e. most frequently accessed & edited) note items. I eventually figured out some were “On My Mac”, but I couldn’t figure out what I was missing in the other versions until I could run diff on the command line, or ediff in emacs. Here is the solution I found…

  1. Prepare Notes. The back-up proceedure described next saves each note by its title, and notes with identical titles are over-written. Of course the source of the trouble is the duplicates, hence first carefully find duplicates and give them unique names (e.g. with the last modification time, as shown in Apple Notes).
  2. Backup your Notes with Notes Exporter, a Mac application written to complement a Notes app Write. Notes Exporter creates a directory filled with text files, one per note. (Note, previously I had recommended Notes Export for HTML export, but this is now (March 2016) no longer available.)
  3. Convert backed up HTML files to TXT (or other formats, like RTF). I never knew about textutil, a Mac utility that converts between document file formats. Run it as
    textutil -convert txt *.html
  4. Then run diff between the duplicates and see what’s changed. In my case, may of the duplicates were just that, exact duplicates, but some reflected changes that I would have lost.

Hope this helps someone!

Random (but amazing) bash shell tricks

This is just a random collection of bash shell tricks I need to collect in one place. I hope you find them useful.

Process Substitution

When a program expects multiple filename arguments, but you don’t have files but streams, “Process Substitution” lets you replace the filename with a command to be executed, and the output of that is fed to the program as a file.

For example, if you wanted to extract the first column from one CSV table, and concatenate it (row-by-row) with the last column from a second CSV table:

paste -d , <( awk -F, '{print $1}' table1.csv) <( -F, awk '{print $NF}' table2.csv)

While loops with read

To process a line at time, this construct can come in handy.

cat | while read CMD ; do
echo $CMD
done

Beware, though, that the commands in the loop are in a sub-shell, and so any variables you set there cannot be accessed later in the script.

FWHM & RESEL details for SPM and FSL

Executive Summary

Random Field Theory depends on a measure called the RESEL count to determine corrected P-values. How RESELs are reported on in SPM and FSL, however, differs. This post reviews the basics of RESELs and FWHM and details how to interpret the values reported by each software.

Background

Random Field Theory (RFT) is the method used in both FSL and SPM to make inference on statistic images. Specifically, it finds thresholds that control the chance of false positives while searching the brain for activations or group differences. The magic of RFT, unlike other methods, like Bonferroni, is that it accounts for the spatial smoothness of the data; as a result, a less severe correction is applied to highly smooth data while a more stringent correction is applied to relatively rough data.

The details of how RFT measures smoothness are technical (it is the inverse square root of the determinant of the variance-covariance matrix of the gradient of the component fields), but fortunately Keith Worsley introduced the notion of FWHM and RESELs [1] to make matters simplier.

For RFT, the FWHM is defined as the Full Width at Half Maximum of the Gaussian kernel required to smooth independent, white noise data to have the same “smoothness” as your data. Here, “smoothness” refers to the technical definition involving the “square root of the determinant…”. Crucially, FWHM is not the applied smoothness, e.g. the Gaussian kernel smoothing applied as part of preprocessing. It is the smoothness of the data fed into the GLM, which is a combination of the intrinsic smoothness of the data (affected by things like image reconstruction parameters and physiological noise). As we have 3-D data, FWHM is usually represented as a 3-vector, [FWHMx FWHMY FWHMZ], though be careful to check whether the units are in voxels or mm.

RESEL stands for RESolution ELement, and is a virtual voxel with dimensions [FWHMx FWHMY FWHMZ]. The RESEL count is the number of RESELs that fit into your search volume; in math mode:

R = V / ( FWHMX × FWHMY × FWHMZ )

where V is the search volume; note that V can be in voxels or mm3 (or whatever) as long as FWHMX, FWHMY & FWHMZ have the same units.

Note a possible source of confusion: A “RESEL” is 3D object, a cuboid; the “RESEL count” is a scalar, your search volume in units of RESELs.

SPM

SPM provides the most complete reporting of FWHM and RESEL-related quantities. In the Results figure, in the footer of the P-value table, SPM lists [FWHMx FWHMY FWHMZ] in both voxel and mm units. It also lists the search volume in RESELs (along with the same in mm3 and voxel units). Finally, it reports the size of one RESEL measured in voxels.

Note, if you check SPM’s arithmetic, you may find that R is not exactly V / ( FWHMX × FWHMY × FWHMZ ). The reason is that a more sophisticated method for measuring the search volume is used that involves counting voxel edges and faces, etc; see spm/src/spm_resels_vol.c & [2] for more details.

FSL

FSL provides no information about RESELs or FWHM in its HTML output. There is some information available, however, buried in the *.feat directories.

In each *.feat directory there is a stats subdirectory. In that directory you’ll find a file called smoothness that contains three lines, one each labeled DLH, VOLUME, RESELS.

DLH is the unintuitive parameter that describes image roughness; precisely it is

DLH = (4 log(2))3/2 / ( FWHMX × FWHMY × FWHMZ)

where FWHMX, FWHMY and FWHMZ are in voxel units.

VOLUME is the search volume in units of voxels. And, as a source of great possible confusion, RESELS is not the RESEL count, but rather, the size of one RESEL in voxel units

RESELS = FWHMX × FWHMY × FWHMZ

Sadly, the FWHMX, FWHMY and FWHMZ are never individually computed or saved. Note, however, the geometric mean of the FWHM’s is available as

AvgFWHM = RESELS1/3

which is some consolation.

Update: The verbose option to smoothest, -V, it will report the individual FWHMx, FWHMx, & FWHMy. But, again, this isn’t saved as part of feat output.

Who cares?

Any neuroimaging researcher should always check the estimated FWHM of their analysis. The reason is that if, for some reason the FWHM smoothness is less than 3 in any dimension, the accuracy of the RFT results can be very poor [3]. While the FSL user can’t examine the 3 individual FHWM, they can at least compute AvgFWHM and check this.

Another reason is to understand differences in corrected P-values. For example, say you conduct 2 studies each with 20 subjects. You notice that, despite having similar T-values, the two studies have quite different FWE corrected P-values. The difference must be due to different RESEL counts, which in turn can be explained by either a difference in search volume (V) or smoothness (FWHM).

Finally, one more reason to check the RESELs is when you are getting bizarre results, like insanely significant results for modest cluster sizes. For example, if VBM results are poorly masked it can happen that the analysis includes a large boundary of air voxels outside the brain. Not only is this unnecessarily increasing your search volume (possibly increasing your multiple testing correction), but the smoothness estimate in air will be dramatically lower than in brain tissue, and thus corrupt the accuracy of the inferences. A bizarre value for RESEL size or FWHM will reveal this problem.

Further Reading

For a light-touch introduction to Random Field Theory, see [3]; for more detail (but less than the mathematical papers) see the relevant chapters in these books [4,5,6].

References

[1] Worsley, K. J., Evans, A. C., Marrett, S., & Neelin, P. (1992). A three-dimensional statistical analysis for CBF activation studies in human brain. Journal of Cerebral Blood Flow and Metabolism, 12(6), 900–918. Preprint

[2] Worsley, K. J., Marrett, S., Neelin, P., Vandal, A. C., Friston, K. J., & Evans, A. C. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human brain mapping, 4(1), 58–73.

[3] Nichols, T. E., & Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: a comparative review. Statistical Methods in Medical Research, 12(5), 419–446.

[4] Poldrack, R. A., Mumford, J. A., & Nichols, T. E. (2011). Handbook of fMRI Data Analysis (p. 450). Cambridge University Press.

[5] Penny, W. D., Friston, K. J., Ashburner, J. T., Kiebel, S. J., & Nichols, T. E. (2006). Statistical Parametric Mapping: The Analysis of Functional Brain Images (p. 656). Academic Press. [Previous version available free]

[6] Jezzard, P., Matthews, P. M., & Smith, S. M. (2002). Functional MRI: An Introduction to Methods (p. 432). Oxford University Press.