A very classy reply from Karl Friston

April 25th, 2012 by Tal Yarkoni

After writing my last post critiquing Karl Friston’s commentary in NeuroImage, I emailed him the link, figuring he might want the opportunity to respond, and also to make sure he knew my commentary wasn’t intended as a personal attack (I have enormous respect for his seminal contributions to the field of neuroimaging). Here’s his very classy reply (posted with permission):

Many thanks for your kind e-mail and link to your blog. I thought your review and deconstruction of the issues were excellent and I concur with the points that you make.

You are absolutely right that I ignored the use of high (corrected) thresholds when controlling for multiple comparisons – and was focusing on the simple case of a single test. I also agree that, ideally, one would report confidence intervals on effect sizes – indeed the original version of my article concluded with this recommendation (now the last line of appendix 1). I remember – at the inception of SPM – discussing with Andrew Holmes the reporting of confidence intervals using statistical maps – however, the closest we ever got was posterior probability maps (PPM), many years later.

My agenda was probably a bit simpler than you might have supposed – it was to point out that significant p-values from small sample studies are valid and will – on average – detect effects whose sizes are bigger than the equivalent effects with larger sample sizes. I did not mean to imply that large studies are useless – although I do believe that unqualified reports of significant p-values from large sample sizes should be treated with caution. Although my agenda was fairly simple, the issues raised may well require more serious consideration – of the sort that you have offered. I submitted the article as a ‘comments and controversy’, anticipating that it would elicit a thoughtful response of the sort in your blog. If you have not done so already; you could prepare your blog for peer-reviewed submission – perhaps as a response to the ‘comments and controversy’ at NeuroImage?

I will not respond to your blog directly; largely because I have never blogged before and prefer to restrict myself to peer-reviewed formats. However, please feel free to use this e-mail in any way you see fit.

With very best wishes,

Karl

PS: although you may have difficulty believing it – all the critiques I caricatured I have actually seen in one form or another – even the retinotopic mapping critique!

Seeing as my optimistic thought when I sent Friston the link was “I hope he doesn’t eat me alive” (not because he has that kind of reputation, but because, frankly, if someone obnoxiously sent me a link to an abrasive article criticizing my work at length, I might not be very happy either), I was very happy to read that. I wrote back:

Thanks very much for your gracious reply–especially since the tone of my commentary was probably a bit abrasive. If I’m being honest with myself, I’m pretty sure I’d have a hard time setting my ego aside long enough to respond this constructively if someone criticized me like this (no matter how I felt about the substantive issues), so it’s very much appreciated.

I won’t take up any of the substantive issues here, since it sounds like we’re in reasonable agreement on most of the issues. As far as submitting a formal response to NeuroImage, I’d normally be happy to do that, but I’m currently boycotting Elsevier journals as part of the Cost of Knowledge campaign, and feel pretty strongly about that, so I won’t be submitting anything to NeuroImage for the foreseeable future. This isn’t meant as an indictment of the NeuroImage editorial board or staff in any way; it’s strictly out of frustration at Elsevier’s policies and recent actions.

Also, while I like the comments and controversies format at NeuroImage a lot, there’s no question that the process is considerably slower than what online communication affords. The reality is that by the time my comment ever came out (probably in a much abridged form), much of the community will have moved on and lost interest in the issue, and I’ve found in the past that the kind of interactive and rapid engagement it’s possible to get online is very hard to approximate in a print forum. But I can completely understand your hesitation to respond this way; it could quickly become unmanageable. For what it’s worth, I don’t really think blogs are the right medium for this kind of thing in the long term anyway, but until we get publisher-independent evaluation platforms that centralize the debate in one place (which I’m hopeful will happen relatively soon), I think they play a useful role.

Anyway, whatever your opinion of the original commentary and/or my post, I think Friston deserves a lot of credit for his response, which, I’ll just reiterate again, is much more civil and tactful than mine would probably have been in his situation. I can’t think of many cases either in print or online when someone has responded so constructively to criticism.

One other thing I forgot to mention in my reply to Friston, but is worth bringing up here: I think SPM confidence interval maps would be a great idea! It would be fantastic if fMRI analysis packages by default produced 3 effect size maps for every analysis–respectively giving the observed, lower bound, and upper bound estimates of effect size at every voxel. This would naturally discourage researchers from making excessively strong claims (since one imagines almost everyone would at least glance at the lower-bound map) while providing reviewers a very easy way to frame concerns about power and sample size (“can the authors please present the confidence interval maps in the appendix?”). Anyone want to write an SPM plug-in to do this?

Sixteen is not magic: Comment on Friston (2012)

April 25th, 2012 by Tal Yarkoni

UPDATE: I’ve posted a very classy email response from Friston here.

In a “comments and controversies” piece published in NeuroImage last week, Karl Friston describes “Ten ironic rules for non-statistical reviewers”. As the title suggests, the piece is presented ironically; Friston frames it as a series of guidelines reviewers can follow in order to ensure successful rejection of any neuroimaging paper. But of course, Friston’s real goal is to convince you that the practices described in the commentary are bad ones, and that reviewers should stop picking on papers for such things as having too little power, not cross-validating results, and not being important enough to warrant publication.

Friston’s piece is, simultaneously, an entertaining satire of some lamentable reviewer practices, and—in my view, at least—a frustratingly misplaced commentary on the relationship between sample size, effect size, and inference in neuroimaging. While it’s easy to laugh at some of the examples Friston gives, many of the positions Friston presents and then skewers aren’t just humorous portrayals of common criticisms; they’re simply bad caricatures of comments that I suspect only a small fraction of reviewers ever make. Moreover, the cures Friston proposes—most notably, the recommendation that sample sizes on the order of 16 to 32 are just fine for neuroimaging studies—are, I’ll argue, much worse than the diseases he diagnoses.

Before taking up the objectionable parts of Friston’s commentary, I’ll just touch on the parts I don’t think are particularly problematic. Of the ten rules Friston discusses, seven seem palatable, if not always helpful:

  • Rule 6 seems reasonable; there does seem to be excessive concern about the violation of assumptions of standard parametric tests. It’s not that this type of thing isn’t worth worrying about at some point, just that there are usually much more egregious things to worry about, and it’s been demonstrated that the most common parametric tests are (relatively) insensitive to violations of normality under realistic conditions.
  • Rule 10 is also on point; given that we know the reliability of peer review is very low, it’s problematic when reviewers make the subjective assertion that a paper just isn’t important enough to be published in such-and-such journal, even as they accept that it’s technically sound. Subjective judgments about importance and innovation should be left to the community to decide. That’s the philosophy espoused by open-access venues like PLoS ONE and Frontiers, and I think it’s a good one.
  • Rules 7 and 9—criticizing a lack of validation or a failure to run certain procedures—aren’t wrong, but seem to me much too broad to support blanket pronouncements. Surely much of the time when reviewers highlight missing procedures, or complain about a lack of validation, there are perfectly good reasons for doing so. I don’t imagine Friston is really suggesting that reviewers should stop asking authors for more information or for additional controls when they think it’s appropriate, so it’s not clear what the point of including this here is. The example Friston gives in Rule 9 (of requesting retinotopic mapping in an olfactory study), while humorous, is so absurd as to be worthless as an indictment of actual reviewer practices. In fact, I suspect it’s so absurd precisely because anything less extreme Friston could have come up with would have caused readers to think, “but wait, that could actually be a reasonable concern…”
  • Rules 1, 2, and 3 seem reasonable as far as they go; it’s just common sense to avoid overconfidence, arguments from emotion, and tardiness. Still, I’m not sure what’s really accomplished by pointing this out; I doubt there are very many reviewers who will read Friston’s commentary and say “you know what, I’m an overconfident, emotional jerk, and I’m always late with my reviews–I never realized this before.” I suspect the people who fit that description—and for all I know, I may be one of them—will be nodding and chuckling along with everyone else.

This leaves Rules 4, 5, and 8, which, conveniently, all focus on a set of interrelated issues surrounding low power, effect size estimation, and sample size. Because Friston’s treatment of these issues strikes me as dangerously wrong, and liable to send a very bad message to the neuroimaging community, I’ve laid out some of these issues in considerably more detail than you might be interested in. If you just want the direct rebuttal, skip to the “Reprising the rules” section below; otherwise the next two sections sketch Friston’s argument for using small sample sizes in fMRI studies, and then describe some of the things wrong with it.

Friston’s argument

Friston’s argument is based on three central claims:

  1. Classical inference (i.e., the null hypothesis testing framework) suffers from a critical flaw, which is that the null is always false: no effects (at least in psychology) are ever truly zero. Collect enough data and you will always end up rejecting the null hypothesis with probability of 1.
  2. Researchers care more about large effects than about small ones. In particular, there is some size of effect that any given researcher will call ‘trivial’, below which that researcher is uninterested in the effect.
  3. If the null hypothesis is always false, and if some effects are not worth caring about in practical terms, then researchers who collect very large samples will invariably end up identifying many effects that are statistically significant but completely uninteresting.

I think it would be hard to dispute any of these claims. The first one is the source of persistent statistical criticism of the null hypothesis testing framework, and the second one is self-evidently true (if you doubt it, ask yourself whether you would really care to continue your research if you knew with 100% confidence that all of your effects would never be any larger than one one-thousandth of a standard deviation). The third one follows directly from the first two.

Where Friston’s commentary starts to depart from conventional wisdom is in the implications he thinks these premises have for the sample sizes researchers should use in neuroimaging studies. Specifically, he argues that since large samples will invariably end up identifying trivial effects, whereas small samples will generally only have power to detect large effects, it’s actually in neuroimaging researchers’ best interest not to collect a lot of data. In other words, Friston turns what most commentators have long considered a weakness of fMRI studies—their small sample size—into a virtue.

Here’s how he characterizes an imaginary reviewer’s misguided concern about low power:

Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Friston suggests that the appropriate response from a clever author would be something like the following:

Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.

This is supported by an extensive appendix (written non-ironically), where Friston presents a series of nice sensitivity and classification analyses intended to give the reader an intuitive sense of what different standardized effect sizes mean, and what the implications are for the detection of statistically significant effects using a classical inference (i.e., hypothesis testing) approach. The centerpiece of the appendix is a loss-function analysis where Friston pits the benefit of successfully detecting a large effect (which he defines as a Cohen’s d of 1, i.e., an effect of one standard deviation) against the cost of rejecting the null when the effect is actually trivial (defined as a d of 0.125 or less). Friston notes that the loss function is minimized (i.e., the difference between the hit rate for large effects and the miss rate for trivial effects is maximized) when n = 16, which is where the number he repeatedly quotes as a reasonable sample size for fMRI studies comes from. (Actually, as I discuss in my Appendix I below, I think Friston’s power calculations are off, and the right number, even given his assumptions, is more like 22. But the point is, it’s a small number either way.)

It’s important to note that Friston is not shy about asserting his conclusion that small samples are just fine for neuroimaging studies—especially in the Appendices, which are not intended to be ironic. He makes claims like the following:

The first appendix presents an analysis of effect size in classical inference that suggests the optimum sample size for a study is between 16 and 32 subjects. Crucially, this analysis suggests significant results from small samples should be taken more seriously than the equivalent results in oversized studies.

And:

In short, if we wanted to optimise the sensitivity to large effects but not expose ourselves to trivial effects, sixteen subjects would be the optimum number.

And:

In short, if you cannot demonstrate a significant effect with sixteen subjects, it is probably not worth demonstrating.

These are very strong claims delivered with minimal qualification, and given Friston’s influence, could potentially lead many reviewers to discount their own prior concerns about small sample size and low power—which would be disastrous for the field. So I think it’s important to explain exactly why Friston is wrong and why his recommendations regarding sample size shouldn’t be taken seriously.

What’s wrong with the argument

Broadly speaking, there are three problems with Friston’s argument. The first one is that Friston presents the absolute best-case scenario as if it were typical. Specifically, the recommendation that a sample of 16 – 32 subjects is generally adequate for fMRI studies assumes that  fMRI researchers are conducting single-sample t-tests at an uncorrected threshold of p < .05; that they only care about effects on the order of 1 sd in size; and that any effect smaller than d = .125 is trivially small and is to be avoided. If all of this were true, an n of 16 (or rather, 22—see Appendix I below) might be reasonable. But it doesn’t really matter, because if you make even slightly less optimistic assumptions, you end up in a very different place. For example, for a two-sample t-test at p < .001 (a very common scenario in group difference studies), the optimal sample size, according to Friston’s own loss-function analysis, turns out to be 87 per group, or 174 subjects in total.

I discuss the problems with the loss-function analysis in much more detail in Appendix I below; the main point here is that even if you take Friston’s argument at face value, his own numbers put the lie to the notion that a sample size of 16 – 32 is sufficient for the majority of cases. It flatly isn’t. There’s nothing magic about 16, and it’s very bad advice to suggest that authors should routinely shoot for sample sizes this small when conducting their studies given that Friston’s own analysis would seem to demand a much larger sample size the vast majority of the time.

 What about uncertainty?

The second problem is that Friston’s argument entirely ignores the role of uncertainty in drawing inferences about effect sizes. The notion that an effect that comes from a small study is likely to be bigger than one that comes from a larger study may be strictly true in the sense that, for any fixed p value, the observed effect size necessarily varies inversely with sample size. It’s true, but it’s also not very helpful. The reason it’s not helpful is that while the point estimate of statistically significant effects obtained from a small study will tend to be larger, the uncertainty around that estimate is also greater—and with sample sizes in the neighborhood of 16 – 20, will typically be so large as to be nearly worthless. For example, a correlation of r = .75 sounds huge, right? But when that correlation is detected at a threshold of p < .001 in a sample of 16 subjects, the corresponding 99.9% confidence interval is .06 – .95—a range so wide as to be almost completely uninformative.

Fortunately, what Friston argues small samples can do for us indirectly—namely, establish that effect sizes are big enough to care about—can be done much more directly, simply by looking at the uncertainty associated with our estimates. That’s exactly what confidence intervals are for. If our goal is to ensure that we only end up talking about results big enough to care about, it’s surely better to answer the question “how big is the effect?” by saying, “d = 1.1, with a 95% confidence interval of 0.2 – 2.1″ than by saying “well it’s statistically significant at p < .001 in a sample of 16 subjects, so it’s probably pretty big”. In fact, if you take the latter approach, you’ll be wrong quite often, for the simple reason that p values will generally be closer to the statistical threshold with small samples than with big ones. Remember that, by definition, the point at which one is allowed to reject the null hypothesis is also the point at which the relevant confidence interval borders on zero. So it doesn’t really matter whether your sample is small or large; if you only just barely managed to reject the null hypothesis, you cannot possibly be in a good position to conclude that the effect is likely to be a big one.

As far as I can tell, Friston completely ignores the role of uncertainty in his commentary. For example, he gives the following example, which is supposed to convince you that you don’t really need large samples:

Imagine we compared the intelligence quotient (IQ) between the pupils of two schools. When comparing two groups of 800 pupils, we found mean IQs of 107.1 and 108.2, with a difference of 1.1. Given that the standard deviation of IQ is 15, this would be a trivial effect size … In short, although the differential IQ may be extremely significant, it is scientifically uninteresting … Now imagine that your research assistant had the bright idea of comparing the IQ of students who had and had not recently changed schools. On selecting 16 students who had changed schools within the past five years and 16 matched pupils who had not, she found an IQ difference of 11.6, where this medium effect size just reached significance. This example highlights the difference between an uninformed overpowered hypothesis test that gives very significant, but uninformative results and a more mechanistically grounded hypothesis that can only be significant with a meaningful effect size.

But the example highlights no such thing. One is not entitled to conclude, in the latter case, that the true effect must be medium-sized just because it came from a small sample. If the effect only just reached significance, the confidence interval by definition just barely excludes zero, and we can’t say anything meaningful about the size of the effect, but only about its sign (i.e., that it was in the expected direction)—which is (in most cases) not nearly as useful.

In fact, we will generally be in a much worse position with a small sample than a large one, because at least with a large sample, we at least stand a chance of being able to distinguish small effects from large ones. Recall that Friston suggests against collecting very large samples for the very reason that they are likely to produce a wealth of statistically-significant-but-trivially-small effects. Well, maybe so, but so what? Why would it be a bad thing to detect trivial effects so long as we were also in an excellent position to know that those effects were trivial? Nothing about the hypothesis-testing framework commits us to treating all of our statistically significant results like they’re equally important. If we have a very large sample, and some of our effects have confidence intervals from 0.02 to 0.15 while others have CIs from 0.42 to 0.52, we would be wise to focus most of our attention on the latter rather than the former. At the very least this seems like a more reasonable approach than deliberately collecting samples so small that they will rarely be able to tell us anything meaningful about the size of our effects.

What about the prior?

The third, and arguably biggest, problem with Friston’s argument is that it completely ignores the prior—i.e., the expected distribution of effect sizes across the brain. Friston’s commentary assumes a uniform prior everywhere; for the analysis to go through, one has to believe that trivial effects and very large effects are equally likely to occur. But this is patently absurd; while that might be true in select situations, by and large, we should expect small effects to be much more common than large ones. In a previous commentary (on the Vul et al “voodoo correlations” paper), I discussed several reasons for this; rather than go into detail here, I’ll just summarize them:

  • It’s frankly just not plausible to suppose that effects are really as big as they would have to be in order to support adequately powered analyses with small samples. For example, a correlational analysis with 20 subjects at p < .001 would require a population effect size of r = .77 to have 80% power. If you think it’s plausible that focal activation in a single brain region can explain 60% of the variance in a complex trait like fluid intelligence or extraversion, I have some property under a bridge I’d like you to come by and look at.
  • The low-hanging fruit get picked off first. Back when fMRI was in its infancy in the mid-1990s, people could indeed publish findings based on samples of 4 or 5 subjects. I’m not knocking those studies; they taught us a huge amount about brain function. In fact, it’s precisely because they taught us so much about the brain that researchers can no longer stick 5 people in a scanner and report that doing a working memory task robustly activates the frontal cortex. Nowadays, identifying an interesting effect is more difficult—and if that effect were really enormous, odds are someone would have found it years ago. But this shouldn’t surprise us; neuroimaging is now a relatively mature discipline, and effects on the order of 1 sd or more are extremely rare in most mature fields (for a nice review, see Meyer et al (2001)).
  • fMRI studies with very large samples invariably seem to report much smaller effects than fMRI studies with small samples. This can only mean one of two things: (a) large studies are done much more poorly than small studies (implausible—if anything, the opposite should be true); or (b) the true effects are actually quite small in both small and large fMRI studies, but they’re inflated by selection bias in small studies, whereas large studies give an accurate estimate of their magnitude (very plausible).
  • Individual differences or between-group analyses, which have much less power than within-subject analyses, tend to report much more sparing activations. Again, this is consistent with the true population effects being on the small side.

To be clear, I’m not saying there are never any large effects in fMRI studies. Under the right circumstances, there certainly will be. What I’m saying is that, in the absence of very good reasons to suppose that a particular experimental manipulation is going to produce a large effect, our default assumption should be that the vast majority of (interesting) experimental contrasts are going to produce diffuse and relatively weak effects.

Note that Friston’s assertion that “if one finds a significant effect with a small sample size, it is likely to have been caused by a large effect size” depends entirely on the prior effect size distribution. If the brain maps we look at are actually dominated by truly small effects, then it’s simply not true that a statistically significant effect obtained from a small sample is likely to have been caused by a large effect size. We can see this easily by thinking of a situation in which an experiment has a weak but very diffuse effect on brain activity. Suppose that the entire brain showed ‘trivial’ effects of d = 0.125 in the population, and that there were actually no large effects at all. A one-sample t-test at p < .001 has less than 1% power to detect this effect, so you might suppose, as Friston does, that we could discount the possibility that a significant effect would have come from a trivial effect size. And yet, because a whole-brain analysis typically involves tens of thousands of tests, there’s a very good chance such an analysis will end up identifying statistically significant effects somewhere in the brain. Unfortunately, because the only way to identify a trivial effect with a small sample is to capitalize on chance (Friston discusses this point in his Appendix II, and additional treatments can be found in Ionnadis (2008), or in my 2009 commentary), that tiny effect won’t look tiny when we examine it; it will in all likelihood look enormous.

Since they say a picture is worth a thousand words, here’s one (from an unpublished paper in progress):

The top panel shows you a hypothetical distribution of effects (Pearson’s r) in a 2-dimensional ‘brain’ in the population. Note that there aren’t any astronomically strong effects (though the white circles indicate correlations of .5 or greater, which are certainly very large). The bottom panel shows what happens when you draw random samples of various sizes from the population and use different correction thresholds/approaches. You can see that the conclusion you’d draw if you followed Friston’s advice—i.e., that any effect you observe with n = 20 must be pretty robust to survive correction—is wrong; the isolated region that survives correction at FDR = .05, while ‘real’ in a trivial sense, is not in fact very strong in the true map—it just happens to be grossly inflated by sampling error. This is to be expected; when power is very low but the number of tests you’re performing is very large, the odds are good that you’ll end up identifying some real effect somewhere in the brain–and the estimated effect size within that region will be grossly distorted because of the selection process.

Encouraging people to use small samples is a sure way to ensure that researchers continue to publish highly biased findings that lead other researchers down garden paths trying unsuccessfully to replicate ‘huge’ effects. It may make for an interesting, more publishable story (who wouldn’t rather talk about the single cluster that supports human intelligence than about the complex, highly distributed pattern of relatively weak effects?), but it’s bad science. It’s exactly the same problem geneticists confronted ten or fifteen years ago when the first candidate gene and genome-wide association studies (GWAS) seemed to reveal remarkably strong effects of single genetic variants that subsequently failed to replicate. And it’s the same reason geneticists now run association studies with 10,000+ subjects and not 300.

Unfortunately, the costs of fMRI scanning haven’t come down the same way the costs of genotyping have, so there’s tremendous resistance at present to the idea that we really do need to routinely acquire much larger samples if we want to get a clear picture of how big effects really are. Be that as it may, we shouldn’t indulge in wishful thinking just because of logistical constraints. The fact that it’s difficult to get good estimates doesn’t mean we should pretend our bad estimates are actually good ones.

What’s right with the argument

Having criticized much of Friston’s commentary, I should note that there’s one part I like a lot, and that’s the section on protected inference in Appendix I. The point Friston makes here is that you can still use a standard hypothesis testing approach fruitfully—i.e., without falling prey to the problem of classical inference—so long as you explicitly protect against the possibility of identifying trivial effects. Friston’s treatment is mathematical, but all he’s really saying here is that it makes sense to use non-zero ranges instead of true null hypotheses. I’ve advocated the same approach before (e.g., here), as I’m sure many other people have. The point is simple: if you think an effect of, say, 1/8th of a standard deviation is too small to care about, then you should define a ‘pseudonull’ hypothesis of d = -.125 to .125 instead of a null of exactly zero.

Once you do that, any time you reject the null, you’re now entitled to conclude with reasonable certainty that your effects are in fact non-trivial in size. So I completely agree with Friston when he observes in the conclusion to the Appendix I that:

…the adage ‘you can never have enough data’ is also true, provided one takes care to protect against inference on trivial effect sizes – for example using protected inference as described above.

Of course, the reason I agree with it is precisely because it directly contradicts Friston’s dominant recommendation to use small samples. In fact, since rejecting non-zero values is more difficult than rejecting a null of zero, when you actually perform power calculations based on protected inference, it becomes immediately apparent just how inadequate samples on the order of 16 – 32 subjects will be most of the time (e.g., rejecting a null of zero when detecting an effect of d = 0.5 with 80% power using a one-sample t-test at p < .05 requires 33 subjects, but if you want to reject a ‘trivial’ effect size of d <= |.125|, that n is now upwards of 50).

Reprising the rules

With the above considerations in mind, we can now turn back to Friston’s rules 4, 5, and 8, and see why his admonitions to reviewers are uncharitable at best and insensible at worst. First, Rule 4 (the under-sampled study). Here’s the kind of comment Friston (ironically) argues reviewers should avoid:

 Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Perhaps many reviewers make exactly this argument; I haven’t been an editor, so I don’t know (though I can say that I’ve read many reviews of papers I’ve co-reviewed and have never actually seen this particular variant). But even if we give Friston the benefit of the doubt and accept that one shouldn’t question the validity of a finding on the basis of small samples (i.e., we accept that p values mean the same thing in large and small samples), that doesn’t mean the more general critique from low power is itself a bad one. To the contrary, a much better form of the same criticism–and one that I’ve raised frequently myself in my own reviews–is the following:

 Reviewer: the authors draw some very strong conclusions in their Discussion about the implications of their main finding. But their finding issues from a sample of only 16 subjects, and the confidence interval around the effect is consequently very large, and nearly include zero. In other words, the authors’ findings are entirely consistent with the effect they report actually being very small–quite possibly too small to care about. The authors should either weaken their assertions considerably, or provide additional evidence for the importance of the effect.

Or another closely related one, which I’ve also raised frequently:

 Reviewer: the authors tout their results as evidence that region R is ‘selectively’ activated by task T. However, this claim is based entirely on the fact that region R was the only part of the brain to survive correction for multiple comparisons. Given that the sample size in question is very small, and power to detect all but the very largest effects is consequently very low, the authors are in no position to conclude that the absence of significant effects elsewhere in the brain suggests selectivity in region R. With this small a sample, the authors’ data are entirely consistent with the possibility that many other brain regions are just as strongly activated by task T, but failed to attain significance due to sampling error. The authors should either avoid making any claim that the activity they observed is selective, or provide direct statistical support for their assertion of selectivity.

Neither of these criticisms can be defused by suggesting that effect sizes from smaller samples are likely to be larger than effect sizes from large studies. And it would be disastrous for the field of neuroimaging if Friston’s commentary succeeded in convincing reviewers to stop criticizing studies on the basis of low power. If anything, we collectively need to focus far greater attention on issues surrounding statistical power.

Next, Rule 5 (the over-sampled study):

Reviewer: I would like to commend the authors for studying such a large number of subjects; however, I suspect they have not heard of the fallacy of classical inference. Put simply, when a study is overpowered (with too many subjects), even the smallest treatment effect will appear significant. In this case, although I am sure the population effects reported by the authors are significant; they are probably trivial in quantitative terms. It would have been much more compelling had the authors been able to show a significant effect without resorting to large sample sizes. However, this was not the case and I cannot recommend publication.

I’ve already addressed this above; the problem with this line of reasoning is that nothing says you have to care equally about every statistically significant effect you detect. If you ever run into a reviewer who insists that your sample is overpowered and has consequently produced too many statistically significant effects, you can simply respond like this:

 Response: we appreciate the reviewer’s concern that our sample is potentially overpowered. However, this strikes us as a limitation of classical inference rather than a problem with our study. To the contrary, the benefit of having a large sample is that we are able to focus on effect sizes rather than on rejecting a null hypothesis that we would argue is meaningless to begin with. To this end, we now display a second, more conservative, brain activation map alongside our original one that raises the statistical threshold to the point where the confidence intervals around all surviving voxels exclude effects smaller than d = .125. The reviewer can now rest assured that our results protect against trivial effects. We would also note that this stronger inference would not have been possible if our study had had a much smaller sample.

There is rarely if ever a good reason to criticize authors for having a large sample after it’s already collected. You can always raise the statistical threshold to protect against trivial effects if you need to; what you can’t easily do is magic more data into existence in order to shrink your confidence intervals.

Lastly, Rule 8 (exploiting ‘superstitious’ thinking about effect sizes):

 Reviewer: It appears that the authors are unaware of the dangers of voodoo correlations and double dipping. For example, they report effect sizes based upon data (regions of interest) previously identified as significant in their whole brain analysis. This is not valid and represents a pernicious form of double dipping (biased sampling or non-independence problem). I would urge the authors to read Vul et al. (2009) and Kriegeskorte et al. (2009) and present unbiased estimates of their effect size using independent data or some form of cross validation.

Friston’s recommended response is to point out that concerns about double-dipping are misplaced, because the authors are typically not making any claims that the reported effect size is an accurate representation of the population value, but only following standard best-practice guidelines to include effect size measures alongside p values. This would be a fair recommendation if it were true that reviewers frequently object to the mere act of reporting effect sizes based on the specter of double-dipping; but I simply don’t think this is an accurate characterization. In my experience, the impetus for bringing up double-dipping is almost always one of two things: (a) authors getting overly excited about the magnitude of the effects they have obtained, or (b) authors conducting non-independent tests and treating them as though they were independent (e.g., when identifying an ROI based on a comparison of conditions A and B, and then reporting a comparison of A and C without considering the bias inherent in this second test). Both of these concerns are valid and important, and it’s a very good thing that reviewers bring them up.

The right way to determine sample size

If we can’t rely on blanket recommendations to guide our choice of sample size, then what? Simple: perform a power calculation. There’s no mystery to this; both brief and extended treatises on statistical power are all over the place, and power calculators for most standard statistical tests are available online as well as in most off-line statistical packages (e.g., I use the pwr package for R). For more complicated statistical tests for which analytical solutions aren’t readily available (e.g., fancy interactions involving multiple within- and between-subject variables), you can get reasonably good power estimates through simulation.

Of course, there’s no guarantee you’ll like the answers you get. Actually, in most cases, if you’re honest about the numbers you plug in, you probably won’t like the answer you get. But that’s life; nature doesn’t care about making things convenient for us. If it turns out that it takes 80 subjects to have adequate power to detect the effects we care about and expect, we can (a) suck it up and go for n = 80, (b) decide not to run the study, or (c) accept that logistical constraints mean our study will have less power than we’d like (which implies that any results we obtain will offer only a fractional view of what’s really going on). What we don’t get to do is look the other way and pretend that it’s just fine to go with 16 subjects simply because the last time we did that, we got this amazingly strong, highly selective activation that successfully made it into a good journal. That’s the same logic that repeatedly produced unreplicable candidate gene findings in the 1990s, and, if it continues to go unchecked in fMRI research, risks turning the field into a laughing stock among other scientific disciplines.

Conclusion

The point of all this is not to convince you that it’s impossible to do good fMRI research with just 16 subjects, or that reviewers don’t sometimes say silly things. There are many questions that can be answered with 16 or even fewer subjects, and reviewers most certainly do say silly things (I sometimes cringe when re-reading my own older reviews). The point is that blanket pronouncements, particularly when made ironically and with minimal qualification, are not helpful in advancing the field, and can be very damaging. It simply isn’t true that there’s some magic sample size range like 16 to 32 that researchers can bank on reflexively. If there’s any generalization that we can allow ourselves, it’s probably that, under reasonable assumptions, Friston’s recommendations are much too conservative. Typical effect sizes and analysis procedures will generally require much larger samples than neuroimaging researchers are used to collecting. But again, there’s no substitute for careful case-by-case consideration.

In the natural course of things, there will be cases where n = 4 is enough to detect an effect, and others where the effort is questionable even with 100 subjects; unfortunately, we won’t know which situation we’re in unless we take the time to think carefully and dispassionately about what we’re doing. It would be nice to believe otherwise; certainly, it would make life easier for the neuroimaging community in the short term. But since the point of doing science is to discover what’s true about the world, and not to publish an endless series of findings that sound exciting but don’t replicate, I think we have an obligation to both ourselves and to the taxpayers that fund our research to take the exercise more seriously.

 

 

Appendix I: Evaluating Friston’s loss-function analysis

In this appendix I review a number of weaknesses in Friston’s loss-function analysis, and show that under realistic assumptions, the recommendation to use sample sizes of 16 – 32 subjects is far too optimistic.

First, the numbers don’t seem to be right. I say this with a good deal of hesitation, because I have very poor mathematical skills, and I’m sure Friston is much smarter than I am. That said, I’ve tried several different power packages in R and finally resorted to empirically estimating power with simulated draws, and all approaches converge on numbers quite different from Friston’s. Even the sensitivity plots seem off by a good deal (for instance, Friston’s Figure 3 suggests around 30% sensitivity with n = 80 and d = 0.125, whereas all the sources I’ve consulted produce a value around 20%). In my analysis, the loss function is minimized at n = 22 rather than n = 16. I suspect the problem is with Friston’s approximation, but I’m open to the possibility that I’ve done something very wrong, and confirmations or disconfirmations are welcome in the comments below. In what follows, I’ll report the numbers I get rather than Friston’s (mine are somewhat more pessimistic, but the overarching point doesn’t change either way).

Second, there’s the statistical threshold. Friston’s analysis assumes that all of our tests are conducted without correction for multiple comparisions (i.e., at p < .05), but this clearly doesn’t apply to the vast majority of neuroimaging studies, which are either conducting massive univariate (whole-brain) analyses, or testing at least a few different ROIs or networks. As soon as you lower the threshold, the optimal sample size returned by the loss-function analysis increases dramatically. If the threshold is a still-relatively-liberal (for whole-brain analysis) p < .001, the loss function is now minimized at 48 subjects–hardly a welcome conclusion, and a far cry from 16 subjects. Since this is probably still the modal fMRI threshold, one could argue Friston should have been trumpeting a sample size of 48 all along—not exactly a ‘small’ sample size given the associated costs.

Third, the n = 16 (or 22) figure only holds for the simplest of within-subject tests (e.g., a one-sample t-test)–again, a best-case scenario (though certainly a common one). It doesn’t apply to many other kinds of tests that are the primary focus of a huge proportion of neuroimaging studies–for instance, two-sample t-tests, or interactions between multiple within-subject factors. In fact, if you apply the same analysis to a two-sample t-test (or equivalently, a correlation test), the optimal sample size turns out to be 82 (41 per group) at a threshold of p < .05, and a whopping 174 (87 per group) at a threshold of p < .001. In other words, if we were to follow Friston’s own guidelines, the typical fMRI researcher who aims to conduct a (liberal) whole-brain individual differences analysis should be collecting 174 subjects a pop. For other kinds of tests (e.g., 3-way interactions), even larger samples might be required.

Fourth, the claim that only large effects–i.e., those that can be readily detected with a sample size of 16–are worth worrying about is likely to annoy and perhaps offend any number of researchers who have perfectly good reasons for caring about effects much smaller than half a standard deviation. A cursory look at most literatures suggests that effects of 1 sd are not the norm; they’re actually highly unusual in mature fields. For perspective, the standardized difference in height between genders is about 1.5 sd; the validity of job interviews for predicting success is about .4 sd; and the effect of gender on risk-taking (men take more risks) is about .2 sd—what Friston would call a very small effect (for other examples, see Meyer et al., 2001). Against this backdrop, suggesting that only effects greater than 1 sd (about the strength of the association between height and weight in adults) are of interest would seem to preclude many, and perhaps most, questions that researchers currently use fMRI to address. Imaging genetics studies are immediately out of the picture; so too, in all likelihood, are cognitive training studies, most investigations of individual differences, and pretty much any experimental contrast that claims to very carefully isolate a relatively subtle cognitive difference. Put simply, if the field were to take Friston’s analysis seriously, the majority of its practitioners would have to pack up their bags and go home. Entire domains of inquiry would shutter overnight.

To be fair, Friston briefly considers the possibility that small sample sizes could be important. But he doesn’t seem to take it very seriously:

Can true but trivial effect sizes can ever be interesting? It could be that a very small effect size may have important implications for understanding the mechanisms behind a treatment effect and that one should maximise sensitivity by using large numbers of subjects. The argument against this is that reporting a significant but trivial effect size is equivalent to saying that one can be fairly confident the treatment effect exists but its contribution to the outcome measure is trivial in relation to other unknown effects…

The problem with the latter argument is that the real world is a complicated place, and most interesting phenomena have many causes. A priori, it is reasonable to expect that the vast majority of effects will be small. We probably shouldn’t expect any single genetic variant to account for more than a small fraction of the variation in brain activity, but that doesn’t mean we should give up entirely on imaging genetics. And of course, it’s worth remembering that, in the context of fMRI studies, when Friston talks about ‘very small effect sizes,’ that’s a bit misleading; even medium-sized effects that Friston presumably allows are interesting could be almost impossible to detect at the sample sizes he recommends. For example, a one-sample t-test with n = 16 subjects detects an effect of d = 0.5 only 46% or 5% of the time at p < .05 and p < .001, respectively. Applying Friston’s own loss function analysis to detection of d = 0.5 returns an optimal sample size of n = 63 at p < .05 and n = 139 at p < .001—a message not entirely consistent with the recommendations elsewhere in his commentary.

ResearchBlogging.orgFriston, K. (2012). Ten ironic rules for non-statistical reviewers NeuroImage DOI: 10.1016/j.neuroimage.2012.04.018

in which I apologize for my laziness, but not really

April 11th, 2012 by Tal Yarkoni

I got back from the Cognitive Neuroscience Society meeting last week. I was planning to write a post-CNS wrap-up thing like I did last year and the year before that, but I seem to have misplaced the energy that’s supposed to fuel such an exercise. So instead I’ll just say I had a great time and leave it at that. What happens in Chicago stays in Chicago, etc. etc.

Also, I really appreciate all the people who came up to me at CNS and said nice things about this blog–it’s nice to know that someone actually reads this (puzzling, mind you, because I’m not sure why anyone reads this, but nice nonetheless). A couple of people encouraged me to blog more often, so I’m making an effort to do that, though the most likely outcome will be miserable failure. Either that or I’ll just start pasting random YouTube videos in this space. Like this one:

p.s. on re-reading that, it kind of make it sound like I was swarmed by adoring fans at CNS. To clarify: “all the people” means, like, four people, and the “nice things” were really more like lukewarm “oh yeah, your blog’s not totally awful” sentiments.

p.p.s. I’ve noticed that a lot of my shorter posts take the form of “I was going to write about X, but I’m not actually going to write about X.” I think this is because I’m very lazy but still want partial credit for having good intentions. Which is kind of ridiculous.

on writing: some anecdotal observations, in no particular order

April 11th, 2012 by Tal Yarkoni
  • Early on in graduate school, I invested in the book “How to Write a Lot“. I enjoyed reading it–mostly because I (mistakenly) enjoyed thinking to myself, “hey, I bet as soon as I finish this book, I’m going to start being super productive!” But I can save you the $9 and tell you there’s really only one take-home point: schedule writing like any other activity, and stick to your schedule no matter what. Though, having said that, I don’t really do that myself. I find I tend to write about 20 hours a week on average. On a very good day, I manage to get a couple of thousand words written, but much more often, I get 200 words written that I then proceed to rewrite furiously and finally trash in frustration. But it all adds up in the long run I guess.
  • Some people are good at writing one thing at a time; they can sit down for a week and crank out a solid draft of a paper without every looking sideways at another project. Personally, unless I have a looming deadline (and I mean a real deadline–more on that below), I find that impossible to do; my general tendency is to work on one writing project for an hour or two, and then switch to something else. Otherwise I pretty much lose my mind. I also find it helps to reward myself–i.e., I’ll work on something I really don’t want to do for an hour, and then play video games for a while switch to writing something more pleasant.
  • I can rarely get any ‘real’ writing (i.e., stuff that leads to publications) done after around 6 pm; late mornings (i.e., right after I wake up) are usually my most productive writing time. And I generally only write for fun (blogging, writing fiction, etc.) after 9 pm. There are exceptions, but by and large that’s my system.
  • I don’t write many drafts. I don’t mean that I never revise papers, because I do–obsessively. But I don’t sit down thinking “I’m going to write a very rough draft, and then I’ll go back and clean up the language.” I sit down thinking “I’m going to write a perfect paper the first time around,” and then I very slowly crank out a draft that’s remarkably far from being perfect. I suspect the former approach is actually the more efficient one, but I can’t bring myself to do it. I hate seeing malformed sentences on the page, even if I know I’m only going to delete them later. It always amazes and impresses me when I get Word documents from collaborators with titles like “AmazingNatureSubmissionVersion18″. I just give my documents all the title “paper_draft”. There might be a V2 or a V3, but there will never, ever be a V18.
  • Papers are not meant to be written linearly. I don’t know anyone who starts with the Introduction, then does the Methods and Results, and then finishes with the Discussion. Personally I don’t even write papers one section at a time. I usually start out by frantically writing down ideas as they pop into my head, and jumping around the document as I think of other things I want to say. I frequently write half a sentence down and then finish it with a bunch of question marks (like so: ???) to indicate I need to come back later and patch it up. Incidentally, this is also why I’m terrified to ever show anyone any of my unfinished paper drafts: an unsuspecting reader would surely come away thinking I suffer from a serious thought disorder. (I suppose they might be right.)
  • Okay, that last point is not entirely true. I don’t write papers completely haphazardly; I do tend to write Methods and Results before Intro and Discussion. I gather that this is a pretty common approach. On the rare occasions when I’ve started writing the Introduction first, I’ve invariably ended up having to completely rewrite it, because it usually turns out the results aren’t actually what I thought they were.
  • My sense is that most academics get more comfortable writing as time goes on. Relatively few grad students have the perseverance to rapidly crank out publication-worthy papers from day 1 (I was definitely not one of them). I don’t think this is just a matter of practice; I suspect part of it is a natural maturation process. People generally get more conscientious as they age; it stands to reason that writing (as an activity most people find unpleasant) should get easier too. I’m better at motivating myself to write papers now, but I’m also much better about doing the dishes and laundry–and I’m pretty sure that’s not because practice makes dishwashing perfect.
  • When I started grad school, I was pretty sure I’d never publish anything, let alone graduate, because I’d never handed in a paper as an undergraduate that wasn’t written at the last minute, whereas in academia, there are virtually no hard deadlines (see below). I’m not sure exactly what changed. I’m still continually surprised every time something I wrote gets published. And I often catch myself telling myself, “hey, self, how the hell did you ever manage to pay attention long enough to write 5,000 words?” And then I reply to myself, “well, self, since you ask, I took a lot of stimulants.”
  • I pace around a lot when I write. A lot. To the point where my labmates–who are all uncommonly nice people–start shooting death glares my way. It’s a heritable tendency, I guess (the pacing, not the death glare attraction); my father also used to pace obsessively. I’m not sure what the biological explanation for it is. My best guess is it’s an arousal-mediated effect: I can think pretty well when I’m around other people, or when I’m in motion, but if I’m sitting at a desk and I don’t already know exactly what I want to say, I can’t get anything done. I generally pace around the lab or house for a while figuring out what I want to say, and then I sit down and write until I’ve forgotten what I want to say, or decide I didn’t really want to say that after all. In practice this usually works out to 10 minutes of pacing for every 5 minutes of writing. I envy people who can just sit down and calmly write for two or three hours without interruption (though I don’t think there are that many of them). At the same time, I’m pretty sure I burn a lot of calories this way.
  • I’ve been pleasantly surprised to discover that I much prefer writing grant proposals to writing papers–to the point where I actually enjoy writing grant proposals. I suspect the main reason for this is that grant proposals have a kind of openness that papers don’t; with a paper, you’re constrained to telling the story the data actually support, whereas a grant proposal is as good as your vision of what’s possible (okay, and plausible). A second part of it is probably the novelty of discovery: once you conduct your analyses, all that’s left is to tell other people what you found, which (to me) isn’t so exciting. I mean, I already think I know what’s going on; what do I care if you know? Whereas when writing a grant, a big part of the appeal for me is that I could actually go out and discover new stuff–just as long as I can convince someone to give me some money first.
  • At a a departmental seminar attended by about 30 people, I once heard a student express concern about an in-progress review article that he and several of the other people at the seminar were collaboratively working on. The concern was that if all of the collaborators couldn’t agree on what was going to go in the paper (and they didn’t seem to be able to at that point), the paper wouldn’t get written in time to make the rapidly approaching deadline dictated by the journal editor. A senior and very brilliant professor responded to the student’s concern by pointing out that this couldn’t possibly be a real problem seeing as in reality there is actually no such thing as a hard writing deadline. This observation didn’t go over so well with some of the other senior professors, who weren’t thrilled that their students were being handed the key to the kingdom of academic procrastination so early in their careers. But it was true, of course: with the major exception of grant proposals (EDIT: and as Garrett points out in the comments below, conference publications in disciplines like Computer Science), most of the things academics write (journal articles, reviews, commentaries, book chapters, etc.) operate on a very flexible schedule. Usually when someone asks you to write something for them, there is some vague mention somewhere of some theoretical deadline, which is typically a date that seems so amazingly far off into the future that you wonder if you’ll even be the same person when it rolls around. And then, much to your surprise, the deadline rolls around and you realize that you must in fact really bea different person, because you don’t seem to have any real desire to work on this thing you signed up for, and instead of writing it, why don’t you just ask the editor for an extension while you go rustle up some motivation. So you send a polite email, and the editor grudgingly says, “well, hmm, okay, you can have another two weeks,” to which you smile and nod sagely, and then, two weeks later, you send another similarly worded but even more obsequious email that starts with the words “so, about that extension…”

    The basic point here is that there’s an interesting dilemma: even though there rarely are any strict writing deadlines, it’s to almost everyone’s benefit to pretend they exist. If I ever find out that the true deadline (insofar as such a thing exists) for the chapter I’m working on right now is 6 months from now and not 3 months ago (which is what they told me), I’ll probably relax and stop working on it for, say, the next 5 and a half months. I sometimes think that the most productive academics are the ones who are just really really good at repeatedly lying to themselves.

  • I’m a big believer in structured procrastination when it comes to writing. I try to always have a really unpleasant but not-so-important task in the background, which then forces me to work on only-slightly-unpleasant-but-often-more-important tasks. Except it often turns out that the unpleasant-but-no-so-important task is actually an unpleasant-but-really-important task after all, and then I wake up in a cold sweat in the middle of the night thinking of all the ways I’ve screwed myself over. No, just kidding. I just bitch about it to my wife for a while and then drown my sorrows in an extra helping of ice cream.
  • I’m really, really, bad at restarting projects I’ve put on the back burner for a while. Right now there are 3 or 4 papers I’ve been working on on-and-off for 3 or 4 years, and every time I pick them up, I write a couple of hundred words and then put them away for a couple of months. I guess what I’m saying is that if you ever have the misfortune of collaborating on a paper with me, you should make sure to nag me several times a week until I get so fed up with you I sit down and write the damn paper. Otherwise it may never see the light of day.
  • I like writing fiction in my spare time. I also occasionally write whiny songs. I’m pretty terrible at both of these things, but I enjoy them, and I’m told (though I don’t believe it for a second) that that’s the important thing.

bio-, chemo-, neuro-, eco-informatics… why no psycho-?

March 2nd, 2012 by Tal Yarkoni

The latest issue of the APS Observer features a special section on methods. I contributed a piece discussing the need for a full-fledged discipline of psychoinformatics:

Scientific progress depends on our ability to harness and apply modern information technology. Many advances in the biological and social sciences now emerge directly from advances in the large-scale acquisition, management, and synthesis of scientific data. The application of information technology to science isn’t just a happy accident; it’s also a field in its own right — one commonly referred to as informatics. Prefix that term with a Greek root or two and you get other terms like bioinformatics, neuroinformatics, and ecoinformatics — all well-established fields responsible for many of the most exciting recent discoveries in their parent disciplines.

Curiously, following the same convention also gives us a field called psychoinformatics — which, if you believe Google, doesn’t exist at all (a search for the term returns only 500 hits as of this writing; Figure 1). The discrepancy is surprising, because labels aside, it’s clear that psychological scientists are already harnessing information technology in powerful and creative ways — often reshaping the very way we collect, organize, and synthesize our data.

Here’s the picture that’s worth, oh, at least ten or fifteen words:

Figure 1. Number of Google search hits for informatics-related terms, by prefix.

You can read the rest of the piece here if you’re so inclined. Check out some of the other articles too; I particularly like Denny Borsboom’s piece on network analysis. EDIT: and Anna Mikulak’s piece on optogenetics! I forgot the piece on optogenetics! How can you not love optogenetics!

deconstructing the turducken

February 18th, 2012 by Tal Yarkoni

This is fiction. Which means it’s entirely made up, and definitely not at all based on any real people or events.

 

Cornelius Kipling came over to our house for Thanksgiving. I didn’t invite him; I would never, ever invite him. He was guaranteed to show up slightly drunk and very belligerent, carrying a two-thirds empty bottle of cheap wine, which he’d then hand to us as if it had arrived unopened from some fancy French cellar.

Cornelius Kiping was never invited; he invited himself.

“Good to see you,” he said to me when we let him in. “Thanks for inviting me over. It’s very kind of you, seeing as how my other plans fell through at the last minute.”

“Hi Kip,” I said, knowing full well he’d never had any other plans.

“Ella,” Kip nodded in my wife’s general direction, taking care not to make direct eye contact. He’d learned from extended experience that once he made eye contact with people, it became much harder to ignore social cues.

“Cornelius,” she said, through a mouth as thin as a zipper.

“Just Kip is fine,” said Kip.

“Cornelius,” my wife repeated, louder this time.

“What are we having for dinner,” Kip asked, handing me a two-thirds empty  bottle of Zinfandel.

“Well,” said Ella, “I was going to make a turducken. But now that you’re here, I figure I should make something special. So we’re having frozen chicken nuggets and mashed potatoes.”

“We spare no expense!” I added cheerfully.

“Funny you should mention turducken,” Kip said, ignoring our jabs. “My new business plan is based on the turducken.”

“Oh really,” I said. “Do pray tell.”

I wasn’t surprised Kip had a new business plan. If anything, I was surprised he’d managed to get as far as exchanging pleasantries before launching into a graphic description of his latest scheme.

“Well,” he said, “it’s not really based on the turducken. The turducken is more of an analogy. To illustrate what it is that my new startup does.”

“And what is it that your new startup does,” Ella’s mouth asked, though the rest of her face very clearly did not care to hear the answer.

“We miniaturize data,” Kip said. He waved his hands in the air with a flourish and looked at us expectantly. It made me think back to something my wife had said about Kip after the first time she ever met him: He thinks he’s a magician, and he acts like he’s a magician, but none of his tricks ever work.

“Prithee, do continue,” I said.

“We take big datasets,” he said. “Large datasets. Enormous datasets. Doesn’t matter what kind of data. You give it to us, and we miniaturize it. We give you back a much smaller dataset. And then you carry on your work with your wonderfully shrunken new spreadsheet, which keeps only the important trends and throws out all of the unnecessary details.”

“Interesting,” I nodded. On a scale of one-to-Kipsanity, this one was a solid five. “And the turducken figures into this how?”

“Weeeeeell, imagine someone hands you a turducken and asks you to figure out what’s in it,” said Kip. “I grant that this may not happen to you very often, but it happens all the time in KipLand. So, you know there’s a bunch of birds in there, all stuffed into each other’s–well, you know–but you don’t know which birds. All you see is this giant deep-fried bird collage, and you want to disassemble it into a set of discrete, identifiable fowls. Now, you hear a lot about how to construct a turducken. But if you think about it, deconstructing a turducken is a much more interesting engineering problem. And that’s what my new venture is all about. We take a complicated mass of data and pick out all the key elements that went into it. Deconstructing the turducken.”

He did the little flourish with his hands again. Again, Ella’s words rang out in my head. None of his tricks ever work.

“That’s quite possibly the craziest thing I’ve ever heard,” I observed. “This whole turducken analogy isn’t working so well for me. I hope you haven’t put it in your promotional materials.”

Kip stared at me unpleasantly for a good ten or fifteen seconds.

“Actually, I take that back,” I said. “That conversation we had about the shinbones on Isaac Newton’s coat of arms that time I ran into you at the dry cleaner’s… that was an order of magnitude more ridiculous.”

Maybe it was a mean thing to say, but you have to understand: my friendship with Kip is built entirely on mutual abuse. And he who flinches first, loses.

“Whatever,” Kip said. He looked annoyed, which filled me with schadenfreude. It wasn’t often he got to experience the full range of emotions he routinely visited on others.

“I didn’t come here to talk about turducken,” he continued. “You brought up the turducken, not me. I just wanted to get your opinion on something…”

Again the hand flourish. Again the voice.

“I’m trying to figure out what to call my new startup,” he said. “Which do you like better: ‘Small data’ or ‘little data’? Neither has the ring of ‘big data‘, but I think both sound better than ‘Kipling Data Miniaturization Services’.”

“How about MiniData,” Ella offered. I noticed she was hitting the wine pretty hard, though we both knew it would do nothing to blunt the Kipling trauma.

“Or maybe NanoData,” I offered. “If you can make the data small enough. What level of compression are you aiming for?”

“Oh, sky’s the limit. Actually, that’s one of the unique features of my service. Most compression schemes have a fixed limit. Take a standard algorithm like bzip2. You compress text, you might get a file 10% of the size if you’re lucky. But binary data? You’ll be lucky if you shrink it by a factor of three. Now, with my NanoData compression service, you as the customer get to choose how much or how little you want. And you select the output format. You can hand me a terabyte of data and say, ‘Dr. Kipling, sir, I want you to distill this eight-dimensional MATLAB array down to a single Excel spreadsheet, no more than 10 rows by 10 columns.’ And that’s exactly what you’ll get.”

“And this miraculously distilled dataset that you give me… will it, by chance, have any passing resemblance to the original dataset I gave you?”

“Oh, sure, if you want it to,” said Kip. “But the fidelity service costs double.”

I resisted the overpowering urge to facepalm.

“Well, it’s certainly not the worst idea you’ve ever had,” I said diplomatically. “But I have to say, I’m amazed you keep launching new startups. A lesser man would have given up ten or twelve bankruptcies ago.”

“I guess I just have an uncanny sense for ideas ten years ahead of their time,” Kip smiled.

“Ten years ahead of anyone’s time,” Ella muttered.

“Right,” I said. “You’re a visionary. You have… the visions. Hey, what happened to that deli you were going to open? The one that was going to sell premium hay sandwiches? I thought that one was going to make it for sure.”

“Terrible shame. Turns out it’s very difficult to get sandwich-grade hay in Colorado. So, you know, it didn’t pan out. Very sad; I even had a name picked out: Hay Day Sandwiches. Get it?”

I didn’t really get it, but still nodded in mock sympathy.

“Anyway, since you brought up my new startup,” Kip said, oblivious to the death rays radiating towards him from Ella’s head, “let me take this opportunity to give the both of you the opportunity of your lifetime. I like you guys, so I’m going to cut you in as my very first angel investors. All I’m asking…”

And here he paused, looking at us. I knew what he was doing; he was trying to gauge our level of displeasure with him so he could pick a number that was sufficiently high, but not completely ridiculous.

“…is fifteen thousand,” he finished “You get 5% of equity, and I’ll even throw in some nice swag. I’m having mugs and frisbees printed up as we speak.”

Around this time, Ella put her head down on her arms; she may or may not have been softly sobbing, I couldn’t really tell.

“That’s quite an offer, Kip,” I said. “And I’m really glad you like me enough to make it. It’s not like I’ve ever bought into your ideas before, but then, the thing I like best about you is how you never take repeated failure for an answer. Unfortunately, I just don’t have fifteen thousand right now. I just spent my last fifteen thousand souping up an old John Deer lawnmower so I can drive around the bike path blaring Ridin’ Dirty from three hundred watt speakers while glowing pink neon lights presage my arrival by five hundred feet. You should see it, it’s beautiful. But I swear, if I hadn’t done that, I’d be ready to sign on the dotted line right now.”

“That’s quite alright,” Kip said. “No harm, no foul. Your loss, my gain. It’s probably crazy of me to give up that much equity for so little anyway; this idea is going to make millions. No. Billions.”

He paused just long enough for some of the delusion to drip off; then I watched in real time as yet another unwise idea corkscrewed through his ear and crawled into his brain.

“Hey,” he said. “I’ve never thought of pimping out a John Deer lawnmower, but that’s a pretty good idea too. You sound like you have some experience with this now; want to go fifty-fifty on a startup? I’ll provide the salesmanship and take advantage of my many business contacts. You provide the technical knowledge. Ella, you can get in on this too; we’ll throw in a free turducken with every purchase.”

This time I definitely heard my wife sobbing, and just like that, it was time for Cornelius Kipling to leave.

a human and a monkey walk into an fMRI scanner…

February 8th, 2012 by Tal Yarkoni

Tor Wager and I have a “news and views” piece in Nature Methods this week; we discuss a paper by Mantini and colleagues (in the same issue) introducing a new method for identifying functional brain homologies across different species–essentially, identifying brain regions in humans and monkeys that seem to do roughly the same thing even if they’re not located in the same place anatomically. Mantini et al make some fairly strong claims about what their approach tells us about the evolution of the human brain (namely, that some cortical regions have undergone expansion relative to monkeys, while others have adapted substantively new functions). For reasons we articulate in our commentary, I’m personally not so convinced by the substantive conclusions, but I do think the core idea underlying the method is a very clever and potentially useful one:

Their technique, interspecies activity correlation (ISAC), uses functional magnetic resonance imaging (fMRI) to identify brain regions in which humans and monkeys exposed to the same dynamic stimulus—a 30-minute clip from the movie The Good, the Bad and the Ugly—show correlated patterns of activity (Fig. 1). The premise is that homologous regions should have similar patterns of activity across species. For example, a brain region sensitive to a particular configuration of features, including visual motion, hands, faces, object and others, should show a similar time course of activity in both species—even if its anatomical location differs across species and even if the precise features that drive the area’s neurons have not yet been specified.

Mo Costandi has more on the paper in an excellent Guardian piece (and I’m not just saying that because he quoted me a few times). All in all, I think it’s a very exciting method, and it’ll be interesting to see how it’s applied in future studies. I think there’s a fairly broad class of potential applications based loosely around the same idea of searching for correlated patterns. It’s an idea that’s already been used by Uri Hasson (an author on the Mantini et al paper) and others fairly widely in the fMRI literature to identify functional correspondences across different subjects; but you can easily imagine conceptually similar applications in other fields too–e.g., correlating gene expression profiles across species in order to identify structural homologies (actually, one could probably try this out pretty easily using the mouse and human data available in the Allen Brain Atlas).

ResearchBlogging.orgMantini D, Hasson U, Betti V, Perrucci MG, Romani GL, Corbetta M, Orban GA, & Vanduffel W (2012). Interspecies activity correlations reveal functional correspondence between monkey and human brain areas. Nature methods PMID: 22306809

Wager, T., & Yarkoni, T. (2012). Establishing homology between monkey and human brains Nature Methods DOI: 10.1038/nmeth.1869

no free lunch in statistics

February 7th, 2012 by Tal Yarkoni

Simon and Tibshirani recently posted a short comment on the Reshef et al MIC data mining paper I blogged about a while back:

The proposal of Reshef et. al. (“MIC”) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations.

They then report some simulation results clearly demonstrating that MIC is (very) underpowered relative to Pearson correlation in most situations, and performs even worse relative to Székely & Rizzo’s distance correlation (which I hadn’t heard about, but will have to look into now). I mentioned low power as a potential concern in my own post, but figured it would be an issue under relatively specific circumstances (i.e., only for certain kinds of associations in relatively small samples). Simon & Tibshirani’s simulations pretty clearly demonstrate that isn’t so. Which, needless to say, rather dampens the enthusiasm for the MIC statistic.

the neuroinformatics of Neopets

January 26th, 2012 by Tal Yarkoni

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…

Attention publishers: the data in your tables want to be free! Free!

January 7th, 2012 by Tal Yarkoni

The Neurosynth database is getting an upgrade over the next couple of weeks; it’s going to go from 4,393 neuroimaging studies to around 5,800. Unfortunately, updating the database is kind of a pain, because academic publishers like to change the format of their full-text HTML articles, which has a nasty habit of breaking the publisher-specific HTML parsers I’ve written. When you expect ScienceDirect to give you <table cellspacing=10>, but you get <table> with no cellspacing attribute (the horror!), bad things happen in XPath land. And then those bad things need to be repaired. And I hate repairing stuff! So I don’t do it very often. Like, once every 6 to 9 months.

In an ideal world, there would be no need to write (and fix) custom filters for different publishers, because the publishers would all simultaneously make XML representations of their articles available (in addition to HTML, PDF, etc.), and then people who have legitimate data mining reasons for regularly downloading hundreds of articles at a time wouldn’t have to cry themselves to sleep every night. But as it stands, only one major publisher of neuroimaging articles (PLoS) provides XML versions of all articles. A minority of articles from other publishers are available in XML from BioMed Central, but that’s still just a fraction of the existing literature.

Anyway, the HTML thing is annoying, but it’s possible to work around it. What’s much more problematic is that some publishers lock up the data in the tables of their articles. To make Neurosynth work, I have to be able to identify rows in tables that look like brain activations. That is, things that look roughly like this:

Most publishers are nice enough to format article tables as HTML tables; which is to say, I can look for tags like <table> and then work down the XPath tree to identify all the the rows, and then scan each rows for values that look activation-like. Then those values go into the database, and poof, next thing you know, you have meta-analytic brain activation maps from hundreds of studies. But some publishers–most notably, Frontiers–throw a wrench in the works by failing to format tables in HTML; instead, they present the tables as images (see for instance this JPEG table, pulled from this article). Which means I can’t really extract any data from them, and as a result, you’re not going to see activations from articles published in Frontiers journals in Neurosynth any time soon. So if you publish fMRI articles in Frontiers in Human Neuroscience regularly, and are wondering why I’ve been ignoring you (I like you! I promise!), now you know.

Anyway, on the remote chance that anyone reading this has any sway with people high up at Frontiers, could you please ask them to release their data? Pretty please? Lack of access to data in tables seems to be a pretty common complaint in the data mining community; I’ve talked to other people in the neuroinformatics world who’ve also expressed frustration about it, and I imagine the same is true of people in other disciplines. It’s particularly surprising given that Frontiers is, in theory, an open access publisher. I can see the data in your tables, Frontiers; why won’t you also let me read it?

Okay, I know this kind of stuff doesn’t really interest anyone; I’m just venting. The main point is, Neurosynth is going to be bigger and (very slightly) better in the near future.