Errata for Software Comparison

In 2019 we published Exploring the impact of analysis software on task fMRI results in Human Brain Mapping, showing how the choice of software package used for analyzing task fMRI data can influence the final results of a study. We reanalyzed data from three published task fMRI studies: Schonberg et al., 2012 (referred to as ‘ds000001’ in the article); Moran et al., 2012 (‘ds000109’); Padmanabhan et al., 2011 (‘ds000120’); reproducing the main group-level contrast maps from each publication within the three main fMRI software packages: AFNI, FSL, and SPM. We then applied a variety of quantitative and qualitative comparison methods to assess the similarity of the statistical maps between the three software packages.

We have recently become aware that five out of the fourteen analysis results used in the article contained errors.

The first two of these erroneous results were our AFNI nonparametric reanalyses of the Schonberg et al., 2012 (ds000001) and Moran et al., 2012 (ds000109) datasets. In both cases, the wrong sub-bricks of the 4D subject-level results files had been specified in the group-level permutation test model, meaning that permutation tests were carried out on the subject-level statistic images rather than the subject-level parameter estimate images as was intended. A similar problem was also found for our AFNI parametric analysis of the Padmanabhan et al., 2011 (ds000120), where again, the subject-level statistic images were wrongly entered into the group-level mixed-effects analysis rather than the intended parameter estimate images. Corrections to these three sets of results have led to minor changes in the quantitative comparisons that were originally reported. Most notably, the within-software Dice coefficients comparing the thresholded parametric and nonparametric results for AFNI are now slightly worse in light of these new results; for example, the AFNI parametric/nonparametric dice coefficient for ds000001 has decreased from 0.833 to 0.700, and the corresponding dice coefficient for ds000109 has decreased from 0.899 to 0.819.

The final two results were our FSL parametric and nonparametric reanalyses of the Moran et al., 2012 (ds000109) dataset. In both cases, the linear model contrast had been incorrectly specified; the FSL parametric and permutation results in the article were for the False Photo Question vs False Belief Question contrast rather than the intended False Photo Story vs False Belief Story contrast used in the original publication. Correcting these analyses has led to notable improvements in the ds000109 AFNI-FSL and FSL-SPM inter-software comparisons, in the form of higher correlations between the unthresholded statistical maps and larger Dice coefficients for comparisons of the thresholded statistical maps. Specifically, correlations now range from 0.429 to 0.896 (was 0.429 to 0.747), and Dice coefficients range from 0 to 0.769 (was 0 to 0.684) for between‐software comparisons.

In response to finding these errors, alongside reanalyzing the datasets we have produced an erratum in HBM (https://doi.org/10.1002/hbm.25302) and uploaded a new preprint to our OSF repository that integrates the corrected results: https://osf.io/7ze9w/. In addition to this, we have created a new release on the Github repository connected to this project that includes all the code for the corrected analyses (Release Software_Comparison_0.9.0, https://github.com/NISOx-BDI/Software_Comparison/releases/tag/0.9.0), and made new Neurovault repositories containing the corrected statistical maps (ds000001: https://neurovault.org/collections/8447/, ds000109: https://neurovault.org/collections/7782/, ds000120: https://neurovault.org/collections/8468/). We have also updated the Jupyter Notebooks originally provided with this work so that users can recreate the amended figures obtained with these new results.

We apologize for any inconvenience this error has caused. The management of the model specifications across the three software packages and three studies used in this investigation required extensive scripting. While we take full responsibility for this mistake, it emphasises the need for a software-independent approach to model specification, and in this regard, we would like to highlight the BIDS Extension Proposal 002 (BEP002): The BIDS Stats-Models Specification as one possible way forward. A standardised way to represent fMRI models across neuroimaging software would reduce the risk of analysis configuration errors and facilitate comparative and multiverse analyses like this one.

Alex Bowring, Camille Maumet, Thomas E. Nichols

Errata for Cluster failure

Follow-up to Bibliometrics of Cluster Inference from the NISOx blog.

Given the widespread misinterpretation of our paper, Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we filed an errata with the PNAS Editoral office:

Errata for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.
Eklund, Anders; Nichols, Thomas E; Knutsson, Hans

Two sentences were poorly worded and could easily be misunderstood as overstating our results.

The last sentence of the Significance statement should read: “These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.”

The first sentence after the heading “The future of fMRI” should have read: “Due to lamentable archiving and data-sharing practices it is unlikely that problematic analyses can be redone.”

These replace the two sentences that mistakenly implied that our work affected all 40,000 publications (see Bibliometrics of Cluster Inference for an guestimate of how much of the literature is potentially affected).

After initially declining the the errata, on the grounds that it was correcting interpretation and not fact, PNAS have agreed to publish it:

  • Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, PNAS 2016 0:1612033113v1-201612033; doi:10.1073/pnas.1612033113.

The online PDF of the orignal article now reflects these corrections.

Update.

Revised to reflect PNAS acceptance of errata. 19 July 2016.
Revised to reference PNAS correction. 16 August 2016.

Bibliometrics of Cluster Inference

Writing about web page http://www.pnas.org/content/early/2016/06/27/1602413113.full

In Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we report on a massive evaluation of statistical methods for task fMRI using resting state as a source of null data. We found very poor performance with one variant of cluster size inference, with a cluster defining threshold (CDT) of P=0.01 giving familywise error (FWE) rates of 50% or more when 5% was expected; CDT of P=0.001 was much better, sometimes within the confidence bounds of our evaluation but almost always above 5% (i.e. suggesting a bias). We subsequently devel into the reasons for these inaccuracies, finding heavy-tailed spatial correlations and spatially-varying smoothness to be the likely suspects. In contrast, non-parametric permutation performs ‘spot on’, only having some inaccuracies in one-sample t-tests, likely due to asymmetric distribution of the errors.

I’ve worked on comparisons of parametric and nonparametric inference methods for neuroimaging essentially my entire career (see refs in Eklund et al.). I’m especially pleased with this work for (1) the use of a massive library of resting-state fMRI for null realisations, finally having samples that reflect the complex spatial and temporal dependence structure of real data, and (2) getting to the bottom of why the the parametric methods don’t work.

However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.

The analysis code (in Matlab) is laid out below, but here is the skinny: Based on some reasonable probabilistic computations, but perhaps fragile samples of the literature, I estimate about 15,000 papers use cluster size inference with correction for multiple testing; of these, around 3,500 use a CDT of P=0.01. 3,500 is about 9% of the entire literature, or perhaps more usefully, 11% of papers containing original data. (Of course some of these 15,000 or 3,500 might use nonparametric inference, but it’s unfortunately rare for fMRI—in contrast, it’s the default inference tool for structural VBM/DTI analyses in FSL).

I frankly thought this number would be higher, but didn’t realise the large proportion of studies that never used any sort of multiple testing correction. (Can’t have inflated corrected significances if you don’t correct!). These calculations suggest 13,000 papers used no multiple testing correction. Of course some of these may be using regions of interest or sub-volume analyses, but it’s a scant few (i.e. clinical trial style outcome) that have absolutely no multiplicity at all. Our paper isn’t directly about this group, but for publications that used the folk multiple testing correction, P10, our paper shows this approach has familywise error rates well in excess of 50%.

So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.

My take homes from this exercise have been:

  • No matter what method you’re using, if you go to town on a P-value on the razor’s edge of P=0.05000, you lean heavily on the assumptions of your method, and any perturbation of the data (or slight failure of the assumptions) would likely give you a result on the other side of the boundary. (This is a truism for all of science, but particularly for neuroimaging where we invariably use hard thresholds.)
  • Meta-analysis is an essential tool to collate a population of studies, and can be used in this very setting when individual results have questionable reliability. In an ideal world, all studies, good and bad, would be published with full data sharing (see next 2 points), each clearly marked with their varying strengths of evidence (no correction < FDR voxel-wise < FDR cluster-wise < FWE cluster-wise < FWE voxel-wise). This rich pool of data, with no file drawer effects, could then be distilled with meta-analysis to see what effects stand up.
  • Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! If all published studies’ T maps were available, we could revisit each analysis (approximately at least). The discipline of neuroimaging is embarrassingly behind genetics and bioinformatics, where SNP-by-SNP or gene-by-gene results (effect size, T-value, P-values) are shared by default. This is not a “data sharing” issue, this is a transparency of results issue… that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.
  • Data sharing, must also come. With the actual full data, we could revisit each paper’s analysis, exactly, and, what’s more, in 5 years, revisit again with even better methods, or for more insightful (e.g. predictive) analyses.

Update

The PNAS article has now been corrected:

Nstudies=40000;   % N studies in PubMed with "fMRI" in title/abstract [1]
Pdata=0.80;       % Prop. of studies actually containing data [2]
Pcorr=0.59;       % Prop. of studies correcting for multiple testing, among data studies [3]
Pclus=0.79;       % Prop. of cluster inference studies, among corrected studies [4]
Pgte01=0.24;      % Prop. of cluster-forming thresholds 0.01 or larger, among corrected cluster inference studies [5]

% Number of studies using corrected  cluster inference (higher P)
Nstudies*Pdata*Pcorr*Pclus        % 14,915

% Number of studies using corrected cluster inference with cluster defining threshold of 0.01 or lower (higher P)
Nstudies*Pdata*Pcorr*Pclus*Pgte01 %  3,579

% Number of studies with original data not using a correction for multiple testing
Nstudies*Pdata*(1-Pcorr)          % 13,120 

Notes

  1. 42,158 rounded down, from a Pubmed search for “((fmri[title/abstract] OR functional MRI[title/abstract]) OR functional Magnetic Resonance Imaging[title/abstract])” conducted 5 July 2016.
  2. Carp, 2012, literature 2007 – 2012, same search as in [1], with additional constraints, from which a random sample of 300 were selected. Of these 300, 59 excluded as not presenting original fMRI data, (300-59)/300=0.80.
  3. Carp, 2012, “Although a majority of studies (59%) reported the use of some variety of correction for multiple comparisons, a substantial minority did not.”
  4. Woo et al., 2014, papers Jan 2010 – Nov 2011, “fMRI” and “threshold”, 1500 papers screened, 814 included; of those 814, 607 used cluster thresholding, and 607/814=0.746 matching 75% in Fig 1. However, Fig 1 also shows that 6% of those studies had no correction. To match up with Carp’s use of corrected statistics, we thus revise this to 607/(814-0.06*814)=0.79
  5. Woo et al., data from author (below), cross-classifying 480 studies with sufficient detail to determine the cluster defining threshold. There are 35 studies with a CDT P>0.01, 80 using CDT P=0.01, giving 115/480=0.240.
            AFNI     BV    FSL    SPM   OTHERS
            ____     __    ___    ___   ______

    >.01      9       5     9       8    4     
    .01       9       4    44      20    3     
    .005     24       6     1      48    3     
    .001     13      20    11     206    5     
    <.001     2       5     3      16    2     

Revisions

Revised to add mention of the slice of the literature that actually does use nonparametric inference (thanks to Nathaniel Daw).

Initially failed to adjust for final comment in note #4; correction increased all the cluster numbers slightly (thanks to Alan Pickering).

Added a passage about possibility of clinical trial type studies, with truly no multiplicity.

Added take-home about meta-analysis.

Revised to reference PNAS correction. 16 August 2016.

Unix find tricks

I will complete a proper, reasoned, motivated blog entry on the find command one day, but for now this just records the essential commands for some (seemingly simple) commands.


Find all files in the current directory but ignore any directories (and their children) called .git:

find . -name .git -prune -o -type f -print

And how I originally used this, show me the word count for each such file:

find . -name .git -prune -o -type f -exec wc {} \;


Count lines, but only for text files:

find . -type f -exec grep -Iq . {} \; -and -exec wc {} \;


Bulk histograms with MassHist.m

When doing a quality control check of image data, one useful thing to look at is the histogram of the image intensities. When presented with masses of images, it’s useful to look at one histogram for each image in a big panel.

This MassHist.m script (for SPM) will print to PDF a ‘gang’ of histograms for all the images provided. The lay out has 2 columns and 5 rows on a portrait-oriented page, but these are easily changed in the code.

MassHist.m

Linux Access Control Lists

Call me a dinosaur, but I really liked the ability of the Andrew File System to give fine tuned control of access to directories. It allowed you to grant other users administrator rights, as well as rights to only insert files, or only delete files, etc. However, I digress.

The purpose of this post is provide a concise ‘cheat sheet’ on the commands needed to use Linux Access Control Lists (ACLs) with the setfacl and getfacl commands.

Show permissions for a file/directory
getfacl Afile Adirectory

Grant a user read, write and execute permissions to a file/directory
setfacl -m UserID:rwx Adirectory Afile

Grant a user only read and execute (traverse) permissions to a file (directory)
setfacl -m UserID:rx Adirectory Afile

Grant a user read, write and execute permissions, recursively, to a directory
setfacl -R -m UserID:rwx Adirectory

Set the default permissions for a directory, so that any new files or directories created in that directory inherit those permissions.
setfacl -m d:UserID:rwx Adirectory

Set both the permissions and default permissions at the same time.
setfacl -m UserID:rwx,d:UserID:rwx Adirectory

Remove the permissions for user
setfacl -x UserID Adirectory

Need to remove default permissions separately, or can combine it
setfacl -x d:UserID Adirectory
setfacl -x UserID,d:UserID Adirectory

In a draconian fashion, over-rule all other permission settings on this file/directory, and ensure no one has write permissions (set ‘mask’)
setfacl -m m::w Adirectory Afile

Mac Goodness (3)

Follow-up to Mac Goodness from the NISOx blog.

#15 More Terminal Bliss: Killing Terminal’s Quit

As should be clear from my other entries I depend on the Mac Terminal program extensively, often having 20 or more windows/tabs open at once. Then you can understand the horror when you accidentally hit CMD-Q, and all those precious Shell sessions die!

The solution was to remove the CMD-Q shortcut, with this gem:

defaults write com.apple.Terminal NSUserKeyEquivalents -dict-add "Quit Terminal" nil

Mac Goodness (2)

Follow-up to Mac Goodness from the NISOx blog.

14. Maximising Terminal Bliss in Mavericks.

For some reason with the Mavericks update, the terminal now copies text to the clipboard with the format of your current terminal font/colors, etc. This is absurd, because the terminal is text, plain and simple. Moreover, I often use the terminal as a clipboard, pasting in some text to get rid of some formatting.

Thankfully I found this defaults mod to change Terminal’s behaviour, so that when you copy to the clipboard all that goes is the text and no formatting:

defaults write com.apple.Terminal CopyAttributesProfile com.apple.Terminal.no-attributes

Apple Notes Backup

Apple products are amazing when they work as intended. They are so well designed that they become integrated into your life and then you become dependent upon them. So when they fail to work it is source of enormous frustration.

Having lost the ability to sync Notes when iOS 7 came out (using my not-newest iphone 4S), when Notes functionality returned I suddenly had growing number of duplicate copies of essential (i.e. most frequently accessed & edited) note items. I eventually figured out some were “On My Mac”, but I couldn’t figure out what I was missing in the other versions until I could run diff on the command line, or ediff in emacs. Here is the solution I found…

  1. Prepare Notes. The back-up proceedure described next saves each note by its title, and notes with identical titles are over-written. Of course the source of the trouble is the duplicates, hence first carefully find duplicates and give them unique names (e.g. with the last modification time, as shown in Apple Notes).
  2. Backup your Notes with Notes Exporter, a Mac application written to complement a Notes app Write. Notes Exporter creates a directory filled with text files, one per note. (Note, previously I had recommended Notes Export for HTML export, but this is now (March 2016) no longer available.)
  3. Convert backed up HTML files to TXT (or other formats, like RTF). I never knew about textutil, a Mac utility that converts between document file formats. Run it as
    textutil -convert txt *.html
  4. Then run diff between the duplicates and see what’s changed. In my case, may of the duplicates were just that, exact duplicates, but some reflected changes that I would have lost.

Hope this helps someone!

Random (but amazing) bash shell tricks

This is just a random collection of bash shell tricks I need to collect in one place. I hope you find them useful.

Process Substitution

When a program expects multiple filename arguments, but you don’t have files but streams, “Process Substitution” lets you replace the filename with a command to be executed, and the output of that is fed to the program as a file.

For example, if you wanted to extract the first column from one CSV table, and concatenate it (row-by-row) with the last column from a second CSV table:

paste -d , <( awk -F, '{print $1}' table1.csv) <( -F, awk '{print $NF}' table2.csv)

While loops with read

To process a line at time, this construct can come in handy.

cat | while read CMD ; do
echo $CMD
done

Beware, though, that the commands in the loop are in a sub-shell, and so any variables you set there cannot be accessed later in the script.