“Open Source, Open Science” Meeting Report – March 2015

[The report below was collectively authored by participants at the Open Source, Open Science meeting, and has been cross-posted in other places.]

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.
Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders’ interests. We also discussed industry interest in “open”, both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question “Why go open?” for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training “decision tree” defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called “glue” – the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the “open science bubble”. Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution – rather than “post preprints!”, say “get more citations!” Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as “open scientists” per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for “next meetings” such as hackathons, no decision has yet been reached. But, a more important decision was clear – the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.

[Authors are listed in reverse alphabetical order; order does not denote relative contribution.]

  1. Tal Yarkoni, University of Texas at Austin
  2. Kara Woo, NCEAS
  3. Andrew Updegrove, Gesmer Updegrove and ConsortiumInfo.org
  4. Kaitlin Thaney, Mozilla Science Lab
  5. Jeffrey Spies, Center for Open Science
  6. Courtney Soderberg, Center for Open Science
  7. Elliott Shore, Association of Research Libraries
  8. Andrew Sallans, Center for Open Science
  9. Karthik Ram, rOpenSci and Berkeley Institute for Data Science
  10. Min Ragan-Kelley, IPython and UC Berkeley
  11. Brian Nosek, Center for Open Science and University of Virginia
  12. Erin C, McKiernan, Wilfrid Laurier University
  13. Jennifer Lin, PLOS
  14. Amye Kenall, BioMed Central
  15. Mark Hahnel, figshare
  16. C. Titus Brown, UC Davis
  17. Sara D. Bowman, Center for Open Science

Now I am become DOI, destroyer of gatekeeping worlds

Digital object identifiers (DOIs) are much sought-after commodities in the world of academic publishing. If you’ve never seen one, a DOI is a unique string associated with a particular digital object (most commonly a publication of some kind) that lets the internet know where to find the stuff you’ve written. For example, say you want to know where you can get a hold of an article titled, oh, say, Designing next-generation platforms for evaluating scientific output: what scientists can learn from the social web. In the real world, you’d probably go to Google, type that title in, and within three or four clicks, you’d arrive at the document you’re looking for. As it turns out, the world of formal resource location is fairly similar to the real world, except that instead of using Google, you go to a website called dx.DOI.org, and then you plug in the string ’10.3389/fncom.2012.00072′, which is the DOI associated with the aforementioned article. And then, poof, you’re automagically linked directly to the original document, upon which you can gaze in great awe for as long as you feel comfortable.

Historically, DOIs have almost exclusively been issued by official-type publishers: Elsevier, Wiley, PLoS and such. Consequently, DOIs have had a reputation as a minor badge of distinction–probably because you’d traditionally only get one if your work was perceived to be important enough for publication in a journal that was (at least nominally) peer-reviewed. And perhaps because of this tendency to view the presence of a DOIs as something like an implicit seal of approval from the Great Sky Guild of Academic Publishing, many journals impose official or unofficial commandments to the effect that, when writing a paper, one shalt only citeth that which hath been DOI-ified. For example, here’s a boilerplate Elsevier statement regarding references (in this case, taken from the Neuron author guidelines):

References should include only articles that are published or in press. For references to in press articles, please confirm with the cited journal that the article is in fact accepted and in press and include a DOI number and online publication date. Unpublished data, submitted manuscripts, abstracts, and personal communications should be cited within the text only.

This seems reasonable enough until you realize that citations that occur “within the text only” aren’t very useful, because they’re ignored by virtually all formal citation indices. You want to cite a blog post in your Neuron paper and make sure it counts? Well, you can’t! Blog posts don’t have DOIs! You want to cite a what? A tweet? That’s just crazy talk! Tweets are 140 characters! You can’t possibly cite a tweet; the citation would be longer than the tweet itself!

The injunction against citing DOI-less documents is unfortunate, because people deserve to get credit for the interesting things they say–and it turns out that they have, on rare occasion, been known to say interesting things in formats other than the traditional peer-reviewed journal article. I’m pretty sure if Mark Twain were alive today, he’d write the best tweets EVER. Well, maybe it would be a tie between Mark Twain and the NIH Bear. But Mark Twain would definitely be up there. And he’d probably write some insightful blog posts too. And then, one imagines that other people would probably want to cite this brilliant 21st-century man of letters named @MarkTwain in their work. Only they wouldn’t be allowed to, you see, because 21st-century Mark Twain doesn’t publish all, or even most, of his work in traditional pre-publication peer-reviewed journals. He’s too impatient to rinse-and-repeat his way through the revise-and-resubmit process every time he wants to share a new idea with the world, even when those ideas are valuable. 21st-century @MarkTwain just wants his stuff out there already where people can see it.

Why does Elsevier hate 21st-century Mark Twain, you ask? I don’t know. But in general, I think there are two main reasons for the disdain many people seem to feel at the thought of allowing authors to freely cite DOI-less objects in academic papers. The first reason has to do with permanence—or lack thereof. The concern here is that if we allowed everyone to cite just any old web page, blog post, or tweet in academic articles, there would be no guarantee that those objects would still be around by the time the citing work was published, let alone several years hence. Which means that readers might be faced with a bunch of dead links. And dead links are not very good at backing up scientific arguments. In principle, the DOI requirement is supposed to act like some kind of safety word that protects a citation from the ravages of time—presumably because having a DOI means the cited work is important enough for the watchful eye of Sauron Elsevier to periodically scan across it and verify that it hasn’t yet fallen off of the internet’s cliffside.

The second reason has to do with quality. Here, the worry is that we can’t just have authors citing any old opinion someone else published somewhere on the web, because, well, think of the children! Terrible things would surely happen if we allowed authors to link to unverified and unreviewed works. What would stop me from, say, writing a paper criticizing the idea that human activity is contributing to climate change, and supporting my argument with “citations” to random pages I’ve found via creative Google searches? For that matter, what safeguard would prevent a brazen act of sockpuppetry in which I cite a bunch of pages that I myself have (anonymously) written? Loosening the injunction against formally citing non-peer-reviewed work seems tantamount to inviting every troll on the internet to a formal academic dinner.

To be fair, I think there’s some merit to both of these concerns. Or at least, I think there used to be some merit to these concerns. Back when the internet was a wee nascent flaky thing winking in and out of existence every time a dial-up modem connection went down, it made sense to worry about permanence (I mean, just think: if we had allowed people to cite GeoCities webpages in published articles, every last one of those citations links would now be dead!) And similarly, back in the days when peer review was an elite sort of activity that could only be practiced by dignified gentlepersons at the cordial behest of a right honorable journal editor, it probably made good sense to worry about quality control. But the merits of such concerns have now largely disappeared, because we now live in a world of marvelous technology, where bits of information cost virtually nothing to preserve forever, and a new post-publication platform that allows anyone to review just about any academic work in existence seems to pop up every other week (cf. PubPeer, PubMed Commons, Publons, etc.). In the modern world, nothing ever goes out of print, and if you want to know what a whole bunch of experts think about something, you just have to ask them about it on Twitter.

Which brings me to this blog post. Or paper. Whatever you want to call it. It was first published on my blog. You can find it–or at least, you could find it at one point in time–at the following URL: http://www.talyarkoni.org/blog/2015/03/04/now-i-am-become-doi-destroyer-of-gates.

Unfortunately, there’s a small problem with this URL: it contains nary a DOI in sight. Really. None of the eleventy billion possible substrings in it look anything like a DOI. You can even scramble the characters if you like; I don’t care. You’re still not going to find one. Which means that most journals won’t allow you to officially cite this blog post in your academic writing. Or any other post, for that matter. You can’t cite my post about statistical power and magical sample sizes; you can’t cite Joe Simmons’ Data Colada post about Mturk and effect sizes; you can’t cite Sanjay Srivastava’s discussion of replication and falsifiability; and so on ad infinitum. Which is a shame, because it’s a reasonably safe bet that there are at least one or two citation-worthy nuggets of information trapped in some of those blog posts (or millions of others), and there’s no reason to believe that these nuggets must all have readily-discoverable analogs somewhere in the “formal” scientific literature. As the Elsevier author guidelines would have it, the appropriate course of action in such cases is to acknowledge the source of an idea or finding in the text of the article, but not to grant any other kind of formal credit.

Now, typically, this is where the story would end. The URL can’t be formally cited in an Elsevier article; end of story. BUT! In this case, the story doesn’t quite end there. A strange thing happens! A short time after it appears on my blog, this post also appears–in virtually identical form–on something called The Winnower, which isn’t a blog at all, but rather, a respectable-looking alternative platform for scientific publication and evaluation.

Even more strangely, on The Winnower, a mysterious-looking set of characters appear alongside the text. For technical reasons, I can’t tell you what the set of characters actually is (because it isn’t assigned until this piece is published!). But I can tell you that it starts with “10.15200/winn”. And I can also tell you what it is: It’s a DOI! It’s one bona fide free DOI, courtesy of The Winnower. I didn’t have to pay for it, or barter any of my services for it, or sign away any little pieces of my soul to get it*. I just installed a WordPress plugin, pressed a few buttons, and… poof, instant DOI. So now this is, proudly, one of the world’s first N (where N is some smallish number probably below 1000) blog posts to dress itself up in a nice DOI (Figure 1). Presumably because it’s getting ready for a wild night out on the academic town.

sticks and stones may break my bones, but DOIs make me feel pretty

Figure 1. Effects of assigning DOIs to blog posts: an anthropomorphic depiction. (A) A DOI-less blog post feels exposed and inadequate; it envies its more reputable counterparts and languishes in a state of torpor and existential disarray. (B) Freshly clothed in a newly-minted DOI, the same blog post feels confident, charismatic, and alert. Brimming with energy, it eagerly awaits the opportunity to move mountains and reshape scientific discourse. Also, it has longer arms.

Does the mere fact that my blog post now has a DOI actually change anything, as far as the citation rules go? I don’t know. I have no idea if publishers like Elsevier will let you officially cite this piece in an article in one of their journals. I would guess not, but I strongly encourage you to try it anyway (in fact, I’m willing to let you try to cite this piece in every paper you write for the next year or so—that’s the kind of big-hearted sacrifice I’m willing to make in the name of science). But I do think it solves both the permanence and quality control issues that are, in theory, the whole reason for journals having a no-DOI-no-shoes-no-service policy in the first place.

How? Well, it solves the permanence problem because The Winnower is a participant in the CLOCKSS archive, which means that if The Winnower ever goes out of business (a prospect that, let’s face it, became a little bit more likely the moment this piece appeared on their site), this piece will be immediately, freely, and automatically made available to the worldwide community in perpetuity via the associated DOI. So you don’t need to trust the safety of my blog—or even The Winnower—any more. This piece is here to stay forever! Rejoice in the cheapness of digital information and librarians’ obsession with archiving everything!

As for the quality argument, well, clearly, this here is not what you would call a high-quality academic work. But I still think you should be allowed to cite it wherever and whenever you want. Why? For several reasons. First, it’s not exactly difficult to determine whether or not it’s a high-quality academic work—even if you’re not willing to exercise your own judgment. When you link to a publication on The Winnower, you aren’t just linking to a paper; you’re also linking to a review platform. And the reviews are very prominently associated with the paper. If you dislike this piece, you can use the comment form to indicate exactly why you dislike it (if you like it, you don’t need to write a comment; instead, send an envelope stuffed with money to my home address).

Second, it’s not at all clear that banning citations to non-prepublication-reviewed materials accomplishes anything useful in the way of quality control. The reliability of the peer-review process is sufficiently low that there is simply no way for it to consistently sort the good from the bad. The problem is compounded by the fact that rejected manuscripts are rarely discarded forever; typically, they’re quickly resubmitted to another journal. The bibliometric literature shows that it’s possible to publish almost anything in the peer-reviewed literature given enough persistence.

Third, I suspect—though I have no data to support this claim—that a worldview that treats having passed peer review and/or receiving a DOI as markers of scientific quality is actually counterproductive to scientific progress, because it promotes a lackadaisical attitude on the part of researchers. A reader who believes that a claim is significantly more likely to be true in virtue of having a DOI is a reader who is slightly less likely to take the extra time to directly evaluate the evidence for that claim. The reality, unfortunately, is that most scientific claims are wrong, because the world is complicated and science is hard. Pretending that there is some reasonably accurate mechanism that can sort all possible sources into reliable and unreliable buckets—even to a first order of approximation—is misleading at best and dangerous at worst. Of course, I’m not suggesting that you can’t trust a paper’s conclusions unless you’ve read every work it cites in detail (I don’t believe I’ve ever done that for any paper!). I’m just saying that you can’t abdicate the responsibility of evaluating the evidence to some shapeless, anonymous mass of “reviewers”. If I decide not to chase down the Smith & Smith (2007) paper that Jones & Jones (2008) cite as critical support for their argument, I shouldn’t be able to turn around later and say something like “hey, Smith & Smith (2007) was peer reviewed, so it’s not my fault for not bothering to read it!”

So where does that leave us? Well, if you’ve read this far, and agree with most or all of the above arguments, I hope I can convince you of one more tiny claim. Namely, that this piece represents (a big part of) the future of academic publishing. Not this particular piece, of course; I mean the general practice of (a) assigning unique identifiers to digital objects, (b) preserving those objects for all posterity in a centralized archive, and (c) allowing researchers to cite any and all such objects in their work however they like. (We could perhaps also add (d) working very hard to promote centralized “post-publication” peer review of all of those objects–but that’s a story for another day.)

These are not new ideas, mind you. People have been calling for a long time for a move away from a traditional gatekeeping-oriented model of pre-publication review and towards more open publication and evaluation models. These calls have intensified in recent years; for instance, in 2012, a special topic in Frontiers in Computational Neuroscience featured 18 different papers that all independently advocated for very similar post-publication review models. Even the actual attachment of DOIs to blog posts isn’t new; as a case in point, consider that C. Titus Brown—in typical pioneering form—was already experimenting with ways to automatically DOIfy his blog posts via FigShare way back in the same dark ages of 2012. What is new, though, is the emergence and widespread adoption of platforms like The Winnower, FigShare, or Research Gate that make it increasingly easy to assign a DOI to academically-relevant works other than traditional journal articles. Thanks to such services, you can now quickly and effortlessly attach a DOI to your open-source software packages, technical manuals and white papers, conference posters, or virtually any other kind of digital document.

Once such efforts really start to pick up steam—perhaps even in the next two or three years—I think there’s a good chance we’ll fall into a positive feedback loop, because it will become increasingly clear that for many kinds of scientific findings or observations, there’s simply nothing to be gained by going through the cumbersome, time-consuming conventional peer review process. To the contrary, there will be all kinds of incentives for researchers to publish their work as soon as they feel it’s ready to share. I mean, look, I can write blog posts a lot faster than I can write traditional academic papers. Which means that if I write, say, one DOI-adorned blog post a month, my Google Scholar profile is going to look a lot bulkier a year from now, at essentially no extra effort or cost (since I’m going to write those blog posts anyway!). In fact, since services like The Winnower and FigShare can assign DOIs to documents retroactively, you might not even have to wait that long. Check back this time next week, and I might have a dozen new indexed publications! And if some of these get cited—whether in “real” journals or on other indexed blog posts—they’ll then be contributing to my citation count and h-index too (at least on Google Scholar). What are you going to do to keep up?

Now, this may all seem a bit off-putting if you’re used to thinking of scientific publication as a relatively formal, laborious process, where two or three experts have to sign off on what you’ve written before it gets to count for anything. If you’ve grown comfortable with the idea that there are “real” scientific contributions on the one hand, and a blooming, buzzing confusion of second-rate opinions on the other, you might find the move to suddenly make everything part of the formal record somewhat disorienting. It might even feel like some people (like, say, me) are actively trying to game the very system that separates science from tabloid news. But I think that’s the wrong perspective. I don’t think anybody—certainly not me—is looking to get rid of peer review. What many people are actively working towards are alternative models of peer review that will almost certainly work better.

The right perspective, I would argue, is to embrace the benefits of technology and seek out new evaluation models that emphasize open, collaborative review by the community as a whole instead of closed pro forma review by two or three semi-randomly selected experts. We now live in an era where new scientific results can be instantly shared at essentially no cost, and where sophisticated collaborative filtering algorithms and carefully constructed reputation systems can potentially support truly community-driven, quantitatively-grounded open peer review on a massive scale. In such an environment, there are few legitimate excuses for sticking with archaic publication and evaluation models—only the familiar, comforting pull of the status quo. Viewed in this light, using technology to get around the limitations of old gatekeeper-based models of scientific publication isn’t gaming the system; it’s actively changing the system—in ways that will ultimately benefit us all. And in that context, the humble self-assigned DOI may ultimately become—to liberally paraphrase Robert Oppenheimer and the Bhagavad Gita—one of the destroyers of the old gatekeeping world.

the weeble distribution: a love story

“I’m a statistician,” she wrote. “By day, I work for the census bureau. By night, I use my statistical skills to build the perfect profile. I’ve mastered the mysterious headline, the alluring photo, and the humorous description that comes off as playful but with a hint of an edge. I’m pretty much irresistible at this point.”

“Really?” I wrote back. “That sounds pretty amazing. The stuff about building the perfect profile, I mean. Not the stuff about working at the census bureau. Working at the census bureau sounds decent, I guess, but not amazing. How do you build the perfect profile? What kind of statistical analysis do you do? I have a bit of programming experience, but I don’t know any statistics. Maybe we can meet some time and you can teach me a bit of statistics.”

I am, as you can tell, a smooth operator.

A reply arrived in my inbox a day later:

No, of course I don’t really spend all my time constructing the perfect profile. What are you, some kind of idiot?

And so was born our brief relationship; it was love at first insult.

“This probably isn’t going to work out,” she told me within five minutes of meeting me in person for the first time. We were sitting in the lobby of the Chateau Laurier downtown. Her choice of venue. It’s an excellent place to meet an internet date; if you don’t like the way they look across the lobby, you just back out quietly and then email the other person to say sorry, something unexpected came up.

“That fast?” I asked. “You can already tell you don’t like me? I’ve barely introduced myself.”

“Oh, no, no. It’s not that. So far I like you okay. I’m just going by the numbers here. It probably isn’t going to work out. It rarely does.”

“That’s a reasonable statement,” I said, “but a terrible thing to say on a first date. How do you ever get a second date with anyone, making that kind of conversation?”

“It helps to be smoking hot,” she said. “Did I offend you terribly?”

“Not really, no. But I’m not a very sentimental kind of guy.”

“Well, that’s good.”

Later, in bed, I awoke to a shooting pain in my leg. It felt like I’d been kicked in the shin.

“Did you just kick me in the shin,” I asked.


“Any particular reason?”

“You were a little bit on my side of the bed. I don’t like that.”

“Oh. Okay. Sorry.”

“I still don’t think this will work,” she said, then rolled over and went back to sleep.

She was right. We dated for several months, but it never really worked. We had terrific fights, and reasonable make-up sex, but our interactions never had very much substance. We related to one another like two people who were pretty sure something better was going to come along any day now, but in the meantime, why not keep what we had going, because it was better than eating dinner alone.

I never really learned what she liked; I did learn that she disliked most things. Mostly our conversations revolved around statistics and food. I’ll give you some examples.

“Beer is the reason for statistics,” she informed me one night while we were sitting at Cicero’s and sharing a lasagna.

“I imagine beer might be the reason for a lot of bad statistics,” I said.

“No, no. Not just bad statistics. All statistics. The discipline of statistics as we know it exists in large part because of beer.”

“Pray, do go on,” I said, knowing it would have been futile to ask her to shut up.

“Well,” she said, “there once was a man named Student…”

I won’t bore you with all the details; the gist of it is that there once was a man by name of William Gosset, who worked for Guinness as a brewer in the early 1900s. Like a lot of other people, Gosset was interested in figuring out how to make Guinness taste better, so he invented a bunch of statistical tests to help him quantify the differences in quality between different batches of beer. Guinness didn’t want Gosset to publish his statistical work under his real name, for fear he might somehow give away their trade secrets, so they made him use the pseudonym “Student”. As a result, modern-day statisticians often work with somethinfg called Student’s t distribution, which is apparently kind of a big deal. And all because of beer.

“That’s a nice story,” I said. “But clearly, if Student—or Gosset or whatever his real name was—hadn’t been working for Guinness, someone else would have invented the same tests shortly afterwards, right? It’s not like he was so brilliant no one else would have ever thought of the same thing. I mean, if Edison hadn’t invented the light bulb, someone else would have. I take it you’re not really saying that without beer, there would be no statistics.”

“No, that is what I’m saying. No beer, no stats. Simple.”

“Yeah, okay. I don’t believe you.”

“Oh no?”

“No. What’s that thing about lies, damned lies, and stat—”


“No. Statisticians.”

“No idea,” she said. “Never heard that saying.”

“It’s that they lie. The saying is that statisticians lie. Repeatedly and often. About anything at all. It’s that they have no moral compass.”

“Sounds about right.”

“I don’t get this whole accurate to within 3 percent 19 times out of 20 business,” I whispered into her ear late one night after we’d had sex all over her apartment. “I mean, either you’re accurate or you’re not, right? If you’re accurate, you’re accurate. And if you’re not accurate, I guess maybe then you could be within 3 percent or 7 percent or whatever. But what the hell does it mean to be accurate X times out of Y? And how would you even know how many times you’re accurate? And why is it always 19 out of 20?”

She turned on the lamp on the nightstand and rolled over to face me. Her hair covered half of her face; the other half was staring at me with those pale blue eyes that always looked like they wanted to either jump you or murder you, and you never knew which.

“You really want me to explain confidence intervals to you at 11:30 pm on a Thursday night?”


“How much time do you have?”

“All, Night, Long,” I said, channeling Lionel Richie.

“Wonderful. Let me put my spectacles on.”

She fumbled around on the nightstand looking for them.

“What do you need your glasses for,” I asked. “We’re just talking.”

“Well, I need to be able to see you clearly. I use the amount of confusion on your face to gauge how much I need to dumb down my explanations.”

Frankly, most of the time she was as cold as ice. The only time she really came alive—other than in the bedroom—was when she talked about statistics. Then she was a different person: excited and exciting, full of energy. She looked like a giant Tesla coil, mid-discharge.

“Why do you like statistics so much,” I asked her over a bento box at ZuNama one day.

“Because,” she said, “without statistics, you don’t really know anything.”

“I thought you said statistics was all about uncertainty.”

“Right. Without statistics, you don’t know anything… and with statistics, you still don’t know anything. But with statistics, we can at least get a sense of how much we know or don’t know.”

“Sounds very… Rumsfeldian,” I said. “Known knowns… unknown unknowns… is that right?”

“It’s kind of right,” she said. “But the error bars are pretty huge.”

“I’m going to pretend I know what that means. If I admit I have no idea, you’ll think I wasn’t listening to you in bed the other night.”

“No,” she said. “I know you were listening. You were listening very well. It’s just that you were understanding very poorly.”

Uncertainty was a big theme for her. Once, to make a point, she asked me how many nostrils a person breathes through at any given time. And then, after I experimented on myself and discovered that the answer was one and not two, she pushed me on it:

“Well, how do you know you’re not the only freak in the world who breathes through one nostril?”

“Easily demonstrated,” I said, and stuck my hand right in front of her face, practically covering her nose.

“Breathe out!”

She did.

“And now breathe in! And then repeat several times!”

She did.

“You see,” I said, retracting my hand once I was satisfied. “It’s not just me. You also breathe through one nostril at a time. Right now it’s your left.”

“That proves nothing,” she said. “We’re not independent observations; I live with you. You probably just gave me your terrible mononarial disease. All you’ve shown is that we’re both sick.”

I realized then that I wasn’t going to win this round—or any other round.

“Try the unagi,” I said, waving at the sushi in a heroic effort to change the topic.

“You know I don’t like to try new things. It’s bad enough I’m eating sushi.”

“Try the unagi,” I suggested again.

So she did.

“It’s not bad,” she said after chewing on it very carefully for a very long time. “But it could use some ketchup.”

“Don’t you dare ask them for ketchup,” I said. “I will get up and leave if you ask them for ketchup.”

She waved her hand at the server.

“There once was a gentleman named Bayes,” she said over coffee at Starbucks one morning. I was running late for work, but so what? Who’s going to pass up the chance to hear about a gentleman named Bayes when the alternative is spending the morning refactoring enterprise code and filing progress reports?

“Oh yes, I’ve heard about him,” I said. “He’s the guy who came up with Bayes’ theorem.” I’d heard of Bayes theorem in some distant class somewhere, and knew it had something to do with statistics, though I had not one clue what it actually referred to.

“No, the Bayes I’m talking about is John Bayes—my mechanic. He’s working on my car right now.”


“No, not really, you idiot. Yes, Bayes as in Bayes’ theorem.”

“Thought so. Well, go ahead and tell me all about him. What is John Bayes famous for?”

“Bayes’ theorem.”

“Huh. How about that.”

She launched into a very dry explanation of conditional probabilities and prior distributions and a bunch of other terms I’d never heard of before and haven’t remembered since. I stopped her about three minutes in.

“You know none of this helps me, right? I mean, really, I’m going to forget anything you tell me. You know what might help, is maybe if instead of giving me these long, dry explanations, you could put things in a way I can remember. Like, if you, I don’t know, made up a limerick. I bet I could remember your explanations that way.”

“Oh, a limerick. You want a Bayesian limerick. Okay.”

She scrunched up her forehead like she was thinking very deeply. Held the pose for a few seconds.

“There once was a man named John Bayes,” she began, and then stopped.

“Yes,” I said. “Go on.”

“Who spent most of his days… calculating the posterior probability of go fuck yourself.”

“Very memorable,” I said, waving for the check.

“Suppose I wanted to estimate how much I love you,” I said over asparagus and leek salad at home one night. “How would I do that?”

“You love me?” she arched an eyebrow.

“Good lord no,” I laughed hysterically. “It’s a completely and utterly hypothetical question. But answer it anyway. How would I do it?”

She shrugged.

“That’s a measurement problem. I’m a statistician, not a psychometrician. I develop and test statistical models. I don’t build psychological instruments. I haven’t the faintest idea how you’d measure love. As I’m sure you’ve observed, it’s something I don’t know or care very much about.”

I nodded. I had observed that.

“You act like there’s a difference between all these things there’s really no difference between,” I said. “Models, measures… what the hell do I care? I asked a simple question, and I want a simple answer.”

“Well, my friend, in that case, the answer is that you must look deep into your own heart and say, heart, how much do I love this woman, and then your heart will surely whisper the answer delicately into your oversized ear.”

“That’s the dumbest thing I’ve ever heard,” I said, tugging self-consciously at my left earlobe. It wasn’t that big.

“Right?” she said. “You said you wanted a simple answer. I gave you a simple answer. It also happens to be a very dumb answer. Well, great, now you know one of the fundamental principles of statistical analysis.”

“That simple answers tend to be bad answers?”

“No,” she said. “That when you’re asking a statistician for help, you need to operationalize your question very carefully, or the statistician is going to give you a sensible answer to a completely different question than the one you actually care about.”

“How come you never ask me about my work,” I asked her one night as we were eating dinner at Chez Margarite. She was devouring lemon-infused pork chops; I was eating a green papaya salad with mint chutney and mango salsa dressing.

“Because I don’t really care about your work,” she said.

“Oh. That’s… kind of blunt.”

“Sorry. I figured I should be honest. That’s what you say you want in a relationship, right? Honesty?”

“Sure,” I said, as the server refilled our water glasses.

“Well,” I offered. “Maybe not that much honesty.”

“Would you like me to feign interest?”

“Maybe just for a bit. That might be nice.”

“Okay,” she sighed, giving me the green light with a hand wave. “Tell me about your work.”

It was a new experience for me; I didn’t want to waste the opportunity, so I tried to choose my words carefully.

“Well, for the last month or so, I’ve been working on re-architecting our site’s database back-end. We’ve never had to worry about scaling before. Our DB can handle a few dozen queries per second, even with some pretty complicated joins. But then someone posts a product page to reddit because of a funny typo, and suddenly we’re getting hundreds of requests a second, and all hell breaks loose.”

I went on to tell her about normal forms and multivalued dependencies and different ways of modeling inheritance in databases. She listened along, nodding intermittently and at roughly appropriate intervals. But I could tell her heart wasn’t in it. She kept looking over with curiosity at the group of middle-aged Japanese businessmen seated at the next table over from us. Or out the window at the homeless man trying to sell rhododendrons to passers-by. Really, she looked everywhere but at me. Finally, I gave up.

“Look,” I said, “I know you’re not into this. I guess I don’t really need to tell you about what I do. Do you want to tell me more about the Weeble distribution?”

Her face lit up with excitement; for a moment, she looked like the moon. A cold, heartless, beautiful moon, full of numbers and error bars and mascara.

Weibull,” she said.

“Fine,” I said. “You tell me about the Weibull distribution, and I’ll feign interest. Then we’ll have crème brulee for dessert, and then I’ll buy you a rhododendron from that guy out there on the way out.”

“Rhododendrons,” she snorted. “What a ridiculous choice of flower.”

“How long do you think this relationship is going to last,” I asked her one brisk evening as we stood outside Gordon’s Gourmets with oversized hot dogs in hand.

I was fully aware our relationship was a transient thing—like two people hanging out on a ferry for a couple of hours, both perfectly willing to having a reasonably good time together until the boat hits the far side of the lake, but neither having any real interest in trading numbers or full names.

I was in it for—let’s be honest—the sex and the conversation. As for her, I’m not really sure what she got out of it; I’m not very good at either of those things. I suppose she probably had a hard time finding anyone willing to tolerate her for more than a couple of days.

“About another month,” she said. “We should take a trip to Europe and break up there. That way it won’t be messy when we come back. You book your plane ticket, I’ll book mine. We’ll go together, but come back separately. I’ve always wanted to end a relationship that way—in a planned fashion where there are no weird expectations and no hurt feelings.”

“You think planning to break up in Europe a month from now is a good way to avoid hurt feelings?”


“Okay, I guess I can see that.”

And that’s pretty much how it went. About a month later, we were sitting in a graveyard in a small village in southern France, winding our relationship down. Wine was involved, and had been involved for most of the day; we were both quite drunk.

We’d gone to see this documentary film about homeless magicians who made their living doing card tricks for tourists on the beaches of the French Riviera, and then we stumbled around town until we came across the graveyard, and then, having had a lot of wine, we decided, why not sit on the graves and talk. And so we sat on graves and talked for a while until we finally ran out of steam and affection for each other.

“How do you want to end it,” I asked her when we were completely out of meaningful words, which took less time than you might imagine.

“You sound so sinister,” she said. “Like we’re talking about a suicide pact. When really we’re just two people sitting on graves in a quiet cemetery in France, about to break up forever.”

“Yeah, that. How do you want to end it.”

“Well, I like endings like in Sex, Lies and Videotape, you know? Endings that don’t really mean anything.”

“You like endings that don’t mean anything.”

“They don’t have to literally mean nothing. I just mean they don’t have to have any deep meaning. I don’t like movies that end on some fake bullshit dramatic note just to further the plot line or provide a sense of closure. I like the ending of Sex, Lies, and Videotape because it doesn’t follow from anything; it just happens.”

“Remind me how it ends?”

“They’re sitting on the steps outside, and Ann—-Andie McDowell’s character–says “I think it’s going to rain. Then Graham says, “it is raining.” And that’s it. Fade to black.”

“So that’s what you like.”


“And you want to end our relationship like that.”


“Okay,” I said. “I guess I can do that.”

I looked around. It was almost dark, and the bottle of wine was empty. Well, why not.

I think it’s going to rain,” I said.

Jesus,” she said incredulously, leaning back against a headstone belonging to some guy named Jean-Francois. ” I meant we should end it like that. That kind of thing. Not that actual thing. What are you, some kind of moron?”

“Oh. Okay. And yes.”

I thought about it for a while.

“I think I got this,” I finally said.

“Ok, go,” she smiled. One of the last—and only—times I saw her smile. It was devastating.

“Okay. I’m going to say: I have some unfinished business to attend to at home. I should really get back to my life. And then you should say something equally tangential and vacuous. Something like: ‘yes, you really should get back there. Your life must be lonely without you.’”

“Your life must be lonely without you…” she tried the words out.

“That’s perfect,” she smiled. “That’s exactly what I wanted.”

Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:


In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):


It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

yet another Python state machine (and why you might care)

TL;DR: I wrote a minimalistic state machine implementation in Python. You can find the code on GitHub. The rest of this post explains what a state machine is and why you might (or might not) care. The post is slanted towards scientists who are technically inclined but lack formal training in computer science or software development. If you just want some documentation or examples, see the README.

A common problem that arises in many software applications is the need to manage an application’s trajectory through a state of discrete states. This problem will be familiar, for instance, to almost every researcher who has ever had to program an experiment for a study involving human subjects: there are typically a number of different states your study can be in (informed consent, demographic information, stimulus presentation, response collection, etc.), and these states are governed by a set of rules that determine the valid progression of your participants from one state to another. For example, a participant can proceed from informed consent to a cognitive task, but never the reverse (on pain of entering IRB hell!).

In the best possible case, the transition rules are straightforward. For example, given states [A, B, C, D], life would be simple if the the only valid transitions were A –> B, B –> C, and C –> D. Unfortunately, the real world is more complicated, and state transitions are rarely completely sequential. More commonly, at least some states have multiple potential destinations. Sometimes the identity of the next state depends on meeting certain conditions while in the current state (e.g., if the subject responded incorrectly, the study may transition to a different state than if they had responded correctly); other times the rules may be probabilistic, or depend on the recent trajectory through state space (e.g., a slot machine transitions to a winning or losing state with some fixed probability that may also depend on its current position, recent history, etc.).

In software development, a standard method for dealing with this kind of problem is to use something called a finite-state machine (FSM). FSMs have been around a relatively long time (at least since Mealy and Moore’s work in the 1950s), and have all kinds of useful applications. In a nutshell, what a good state machine implementation does is represent much of the messy logic governing state transitions in a more abstract, formal and clean way. Rather than having to write a lot of complicated nested logic to direct the flow of the application through state space, one can usually get away with a terse description of (a) the possible states of the machine and (b) a list of possible transitions, including a specification of the source and destination states for each transition, what conditions must be met in order for the transition to execute, etc.

For example, suppose you need to write some code to transition between different phases in an online experiment. Your naive implementation might look vaguely like this (leaving out a lot of supporting code and focusing just on the core logic):

This is a minimalistic example, but already, it illustrates several common scenarios–e.g., that the transition from one state to another often depends on meeting some specified condition (we don’t advance beyond the informed consent stage until the user signs the document), and that there may be some actions we want to issue immediately before or after a particular kind of transition (e.g., we save survey responses before we move onto the next phase).

The above code is still quite manageable, so if things never get any more complex than this, there may be no reason to abandon a (potentially lengthy) chain of conditionals in favor of a fundamentally different approach. But trouble tends to arises when the complexity does increase–e.g., you need to throw a few more states into the mix later on–or when you need to move stuff around (e.g., you decide to administer the task before the demographic survey). If you’ve ever had the frustrating experience of tracing the flow of your app through convoluted logic scattered across several files, and being unable to figure out why your code is entering the wrong state in response to some triggered event, the state machine pattern may be right for you.

I’ve made extensive use of state machines in the past when building online studies, and finding a suitable implementation has never been a problem. For example, in Rails–which is what most of my apps have been built in–there are a number of excellent options, including the state_machine plugin and (more recently) Statesman. In the last year or two, though, I’ve begun to transition all of my web development to Python (if you want to know why, read this). Python is a very common language, and the basic FSM pattern is very simple, so there are dozens of Python FSM implementations out there. But for some reason, very few of the Python implementations are as elegant and usable as their Ruby analogs. This isn’t to say there aren’t some nice ones (I’m partial to Fysom, for instance)–just that none of them quite meet my needs (in particular, there are very few fully object-oriented implementations, and I like to have my state machine tightly coupled with the model it’s managing). So I decided to write one. It’s called Transitions, and you can find the code on GitHub, or install it directly from the command prompt (“pip install transitions”, assuming you have pip installed). It’s very lightweight–fewer than 200 lines of code (the documentation is about 10 times as long!)–but still turns out to be quite functional.

For example, here’s some code that does almost exactly the same thing as what we saw above (there are much more extensive examples and documentation in the GitHub README):

That’s it! And now we have a nice object-oriented state machine that elegantly transitions between phases of matter, triggers callback functions as needed, and supports conditional transitions, branching, and various other nice features, all without ever having to write a single explicit conditional or for-loop. Understanding what’s going on is as simple as looking at the specification of the states and transitions. For example, we can tell at a glance from the second transition that if the model is currently in the ‘demographics’ state, calling advance() will effect a transition to the ‘personality’ state–conditional on the validate_demographics() function returns True. Also, right before the transition executes, the save_demographics() callback will be called.

As I noted above, given the simplicity of the example, this may not seem like a huge win. If anything, the second snippet is slightly longer than the first. But it’s also much clearer (once you’re familiar with the semantics of Transitions), scales much better as complexity increases, and will be vastly easier to modify when you need to change anything.

Anyway, I mention all of this here for two reasons. First, as small and simple a project as this is, I think it ended up being one of the more elegant and functional minimalistic Python FSMs–so I imagine a few other people might find it useful (yes, I’m basically just exploiting my PageRank on Google to drive traffic to GitHub). And second, I know many people who read this blog are researchers who regularly program experiments, but probably haven’t encountered state machines before. So, Python implementation aside, the general idea that there’s a better way to manage complex state transitions than writing a lot of ugly logic seems worth spreading.

In defense of In Defense of Facebook

A long, long time ago (in social media terms), I wrote a post defending Facebook against accusations of ethical misconduct related to a newly-published study in PNAS. I won’t rehash the study, or the accusations, or my comments in any detail here; for that, you can read the original post (I also recommend reading this or this for added context). While I stand by most of what I wrote, as is the nature of things, sometimes new information comes to light, and sometimes people say things that make me change my mind. So I thought I’d post my updated thoughts and reactions. I also left some additional thoughts in a comment on my last post, which I won’t rehash here.

Anyway, in no particular order…

I’m not arguing for a lawless world where companies can do as they like with your data

Some people apparently interpreted my last post as a defense of Facebook’s data use policy in general. It wasn’t. I probably brought this on myself in part by titling the post “In Defense of Facebook”. Maybe I should have called it something like “In Defense of this one particular study done by one Facebook employee”. In any case, I’ll reiterate: I’m categorically not saying that Facebook–or any other company, for that matter–should be allowed to do whatever it likes with its users’ data. There are plenty of valid concerns one could raise about the way companies like Facebook store, manage, and use their users’ data. And for what it’s worth, I’m generally in favor of passing new rules regulating the use of personal data in the private sector. So, contrary to what some posts suggested, I was categorically not advocating for a laissez-faire world in which large corporations get to do as they please with your information, and there’s nothing us little people can do about it.

The point I made in my last post was much narrower than that–namely, that picking on the PNAS study as an example of ethically questionable practices at Facebook was a bad idea, because (a) there aren’t any new risks introduced by this manipulation that aren’t already dwarfed by the risks associated with using Facebook itself (which is not exactly a high-risk enterprise to begin with), and (b) there are literally thousands of experiments just like this being conducted every day by large companies intent on figuring out how best to market their products and services–so Facebook’s study doesn’t stand out in any respect. My point was not that you shouldn’t be concerned about who has your data and how they’re using it, but that it’s deeply counterproductive to go after Facebook for this particular experiment when Facebook is of the few companies in this arena who actually (occasionally) publish the results of their findings in the scientific literature, instead of hiding them entirely from the light, as almost everyone else does. Of course, that will probably change as a result of this controversy.

I Was Wrong–A/B Testing Edition.

One claim I made in my last post that was very clearly wrong is this (emphasis added):

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface.

After watching the commentary over the past two days, I think it’s pretty clear I was wrong about this. A surprisingly large number of people clearly were genuinely unaware that Facebook, Twitter, Google, and other major players in every major industry (not just tech–also banks, groceries, department stores, you name it) are constantly running large-scale, controlled experiments on their users and customers. For instance, here’s a telling comment left on my last post:

The main issue I have with the experiment is that they conducted it without telling us. Given, that would have been counterproductive, but even a small adverse affect is still an adverse affect. I just don’t like the idea that corporations can do stuff to me without my consent. Just my opinion.

Similar sentiments are all over the place. Clearly, the revelation that Facebook regularly experiments on its users without their knowledge was indeed just that to many people–a revelation. I suppose in this sense, there’s potentially a considerable upside to this controversy, inasmuch as it has clearly served to raise awareness of industry-standard practices.

Questions about the ethics of the PNAS paper’s publication

My post focused largely on the question of whether the experiment Facebook conducted was itself illegal or unethical. I took this to be the primary concern of most lay people who have expressed concern about the episode. As I discussed in my post, I think it’s quite clear that the experiment itself is (a) entirely legal and that (b) any ethical objections one could raise are actually much broader objections about the way we regulate data use and consumer privacy, and have nothing to do with Facebook in particular. However, there’s a separate question that does specifically concern Facebook–or really, the authors of the PNAS paper–which is whether the authors, in their efforts to publish their findings, violated any laws or regulations.

When I wrote my post, I was under the impression–based largely on reports of an interview with the PNAS editor, Susan Fiske–that the authors had in fact obtained approval to conduct the study from an IRB, and had simply neglected to include that information in the text (which would have been an Editorial lapse, but not an unethical act). I wrote as much in a comment on my post. I was not suggesting–as some seemed to take away–that Facebook doesn’t need to get IRB approval. I was operating on the assumption that it had obtained IRB approval, based on the information available at the time.

In any case, it now appears that may not be exactly what happened. Unfortunately, it’s not yet clear exactly what did happen. One version of events people have suggested is that the study’s authors exploited a loophole in the rules by having Facebook conduct and analyze the experiment without the involvement of the other authors–who only contributed to the genesis of the idea and the writing of the manuscript. However, this interpretation is not unambiguous, and risks maligning the authors’ reputations unfairly, because Adam Kramer’s post explaining the motivation for the experiment suggests that the idea for the experiment originated entirely at Facebook, and was related to internal needs:

The reason we did this research is because we care about the emotional impact of Facebook and the people that use our product. We felt that it was important to investigate the common worry that seeing friends post positive content leads to people feeling negative or left out. At the same time, we were concerned that exposure to friends’ negativity might lead people to avoid visiting Facebook. We didn’t clearly state our motivations in the paper.

How you interpret the ethics of the study thus depends largely on what you believe actually happened. If you believe that the genesis and design of the experiment were driven by Facebook’s internal decision-making, and the decision to publish an interesting finding came only later, then there’s nothing at all ethically questionable about the authors’ behavior. It would have made no more sense to seek out IRB approval for this one experiment than for any of the other in-house experiments Facebook regularly conducts. And there is, again, no question whatsoever that Facebook does not have to get approval from anyone to do experiments that are not for the purpose of systematic, generalizable research.

Moreover, since the non-Facebook authors did in fact ask the IRB to review their proposal to use archival data–and the IRB exempted them from review, as is routinely done for this kind of analysis–there would be no legitimacy to the claim that the authors acted unethically. About the only claim one could raise an eyebrow at is that the authors “didn’t clearly state” their motivations. But since presenting a post-hoc justification for one’s studies that has nothing to do with the original intention is extremely common in psychology (though it shouldn’t be), it’s not really fair to fault Kramer et al for doing something that is standard practice.

If, on the other hand, the idea for the study did originate outside of Facebook, and the authors deliberately attempted to avoid prospective IRB review, then I think it’s fair to say that their behavior was unethical. However, given that the authors were following the letter of the law (if clearly not the spirit), it’s not clear that PNAS should have, or could have, rejected the paper. It certainly should have demanded that information regarding interactions with the IRB be included in the manuscript, and perhaps it could have published some kind of expression of concern alongside the paper. But I agree with Michelle Meyer’s analysis that, in taking the steps they took, the authors are almost certainly operating within the rules, because (a) Facebook itself is not subject to HHS rules, (b) the non-Facebook authors were not technically “engaged in research”, and (c) the archival use of already-collected data by the non-Facebook authors was approved by the Cornell IRB (or rather, the study was exempted from further review).

Absent clear evidence of what exactly happened in the lead-up to publication, I think the appropriate course of action is to withhold judgment. In the interim, what the episode clearly does do is lay bare how ill-prepared the existing HHS regulations are for dealing with the research use of data collected online–particularly when the data was acquired by private entities. Actually, it’s not just research use that’s problematic; it’s clear that many people complaining about Facebook’s conduct this week don’t really give a hoot about the “generalizable knowledge” side of things, and are fundamentally just upset that Facebook is allowed to run these kinds of experiments at all without providing any notification.

In my view, what’s desperately called for is a new set of regulations that provide a unitary code for dealing with consumer data across the board–i.e., in both research and non-research contexts. This leaves aside exactly what such regulations would look like, of course. My personal view is that the right direction to move in is to tighten consumer protection laws to better regulate management and use of private citizens’ data, while simultaneously liberalizing the research use of private datasets that have already been acquired. For example, I would favor a law that (a) forced Facebook and other companies to more clearly and explicitly state how they use their users’ data, (b) provided opt-out options when possible, along with the ability for users to obtain report of how their data has been used in the past, and (c) gave blanket approval to use data acquired under these conditions for any and all academic research purposes so long as the data are deidentified. Many people will disagree with this, of course, and have very different ideas. That’s fine; the key point is that the conversation we should be having is about how to update and revise the rules governing research vs. non-research uses of data in such a way that situations like the PNAS study don’t come up again.

What Facebook does is not research–until they try to publish it

Much of the outrage over the Facebook experiment is centered around the perception that Facebook shouldn’t be allowed to conduct research on its users without their consent. What many people mean by this, I think, is that Facebook shouldn’t be allowed to conduct any experiments on its users for purposes of learning things about user experience and behavior unless Facebook explicitly asks for permission. A point that I should have clarified in my original post is that Facebook users are, in the normal course of things, not considered participants in a research study, no matter how or how much their emotions are manipulated. That’s because the HHS’s definition of research includes, as a necessary component, that there be an active intention to contribute to generalizable new knowledge.

Now, to my mind, this isn’t a great way to define “research”–I think it’s a good idea to avoid definitions that depend on knowing what people’s intentions were when they did something. But that’s the definition we’re stuck with, and there’s really no ambiguity over whether Facebook’s normal operations–which include constant randomized, controlled experimentation on its users–constitute research in this sense. They clearly don’t. Put simply, if Facebook were to eschew disseminating its results to the broader community, the experiment in question would not have been subject to any HHS regulations whatsoever (though, as Michelle Meyer astutely pointed out, technically the experiment probably isn’t subject to HHS regulation even now, so the point is moot). Again, to reiterate: it’s only the fact that Kramer et al wanted to publish their results in a scientific journal that opened them up to criticism of research misconduct in the first place.

This observation may not have any impact on your view if your concern is fundamentally about the publication process–i.e., you don’t object to Facebook doing the experiment; what you object to is Facebook trying to disseminate their findings as research. But it should have a strong impact on your views if you were previously under the impression that Facebook’s actions must have violated some existing human subjects regulation or consumer protection law. The laws in the United States–at least as I understand them, and I admittedly am not a lawyer–currently afford you no such protection.

Now, is it a good idea to have two very separate standards, one for research and one for everything else? Probably not. Should Facebook be allowed to do whatever it wants to your user experience so long as it’s covered under the Data Use policy in the user agreement you didn’t read? Probably not. But what’s unequivocally true is that, as it stands right now, your interactions with Facebook–no matter how your user experience, data, or emotions are manipulated–are not considered research unless Facebook manipulates your experience with the express intent of disseminating new knowledge to the world.

Informed consent is not mandatory for research studies

As a last point, there seems to be a very common misconception floating around among commentators that the Facebook experiment was unethical because it didn’t provide informed consent, which is a requirement for all research studies involving experimental manipulation. I addressed this in the comments on my last post in response to other comments:

[I]t’s simply not correct to suggest that all human subjects research requires informed consent. At least in the US (where Facebook is based), the rules governing research explicitly provide for a waiver of informed consent. Directly from the HHS website:

An IRB may approve a consent procedure which does not include, or which alters, some or all of the elements of informed consent set forth in this section, or waive the requirements to obtain informed consent provided the IRB finds and documents that:

(1) The research involves no more than minimal risk to the subjects;

(2) The waiver or alteration will not adversely affect the rights and welfare of the subjects;

(3) The research could not practicably be carried out without the waiver or alteration; and

(4) Whenever appropriate, the subjects will be provided with additional pertinent information after participation.

Granting such waivers is a commonplace occurrence; I myself have had online studies granted waivers before for precisely these reasons. In this particular context, it’s very clear that conditions (1) and (2) are met (because this easily passes the “not different from ordinary experience” test). Further, Facebook can also clearly argue that (3) is met, because explicitly asking for informed consent is likely not viable given internal policy, and would in any case render the experimental manipulation highly suspect (because it would no longer be random). The only point one could conceivably raise questions about is (4), but here again I think there’s a very strong case to be made that Facebook is not about to start providing debriefing information to users every time it changes some aspect of the news feed in pursuit of research, considering that its users have already agreed to its User Agreement, which authorizes this and much more.

Now, if you disagree with the above analysis, that’s fine, but what should be clear enough is that there are many IRBs (and I’ve personally interacted with some of them) that would have authorized a waiver of consent in this particular case without blinking. So this is clearly well within “reasonable people can disagree” territory, rather than “oh my god, this is clearly illegal and unethical!” territory.

I can understand the objection that Facebook should have applied for IRB approval prior to conducting the experiment (though, as I note above, that’s only true if the experiment was initially conducted as research, which is not clear right now). However, it’s important to note that there is no guarantee that an IRB would have insisted on informed consent at all in this case. There’s considerable heterogeneity in different IRBs’ interpretation of the HHS guidelines (and in fact, even across different reviewers within the same IRB), and I don’t doubt that many IRBs would have allowed Facebook’s application to sail through without any problems (see, e.g., this comment on my last post)–though I think there’s a general consensus that a debriefing of some kind would almost certainly be requested.

In defense of Facebook

[UPDATE July 1st: I’ve now posted some additional thoughts in a second post here.]

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people.

The study

Before getting into the sources of that rage–and why I think it’s misplaced–though, it’s worth describing the study and its results. Here’s a description of the basic procedure, from the paper:

The experiment manipulated the extent to which people (N = 689,003) were exposed to emotional expressions in their News Feed. This tested whether exposure to emotions led people to change their own posting behaviors, in particular whether exposure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment. Two parallel experiments were conducted for positive and negative emotion: One in which exposure to friends’ positive emotional content in their News Feed was reduced, and one in which exposure to negative emotional content in their News Feed was reduced. In these conditions, when a person loaded their News Feed, posts that contained emotional content of the relevant emotional valence, each emotional post had between a 10% and 90% chance (based on their User ID) of being omitted from their News Feed for that specific viewing.

And here’s their central finding:

What the figure shows is that, in the experimental conditions, where negative or positive emotional posts are censored, users produce correspondingly more positive or negative emotional words in their own status updates. Reducing the number of negative emotional posts users saw led those users to produce more positive, and fewer negative words (relative to the unmodified control condition); conversely, reducing the number of presented positive posts led users to produce more negative and fewer positive words of their own.

Taken at face value, these results are interesting and informative. For the sake of contextualizing the concerns I discuss below, though, two points are worth noting. First, these effects, while highly statistically significant, are tiny. The largest effect size reported had a Cohen’s d of 0.02–meaning that eliminating a substantial proportion of emotional content from a user’s feed had the monumental effect of shifting that user’s own emotional word use by two hundredths of a standard deviation. In other words, the manipulation had a negligible real-world impact on users’ behavior. To put it in intuitive terms, the effect of condition in the Facebook study is roughly comparable to a hypothetical treatment that increased the average height of the male population in the United States by about one twentieth of an inch (given a standard deviation of ~2.8 inches). Theoretically interesting, perhaps, but not very meaningful in practice.

Second, the fact that users in the experimental conditions produced content with very slightly more positive or negative emotional content doesn’t mean that those users actually felt any differently. It’s entirely possible–and I would argue, even probable–that much of the effect was driven by changes in the expression of ideas or feelings that were already on users’ minds. For example, suppose I log onto Facebook intending to write a status update to the effect that I had an “awesome day today at the beach with my besties!” Now imagine that, as soon as I log in, I see in my news feed that an acquaintance’s father just passed away. I might very well think twice about posting my own message–not necessarily because the news has made me feel sad myself, but because it surely seems a bit unseemly to celebrate one’s own good fortune around people who are currently grieving. I would argue that such subtle behavioral changes, while certainly responsive to others’ emotions, shouldn’t really be considered genuine cases of emotional contagion. Yet given how small the effects were, one wouldn’t need very many such changes to occur in order to produce the observed results. So, at the very least, the jury should still be out on the extent to which Facebook users actually feel differently as a result of this manipulation.

The concerns

Setting aside the rather modest (though still interesting!) results, let’s turn to look at the criticism. Here’s what Katy Waldman, writing in a Slate piece titled “Facebook’s Unethical Experiment“, had to say:

The researchers, who are affiliated with Facebook, Cornell, and the University of California–San Francisco, tested whether reducing the number of positive messages people saw made those people less likely to post positive content themselves. The same went for negative messages: Would scrubbing posts with sad or angry words from someone’s Facebook feed make that person write fewer gloomy updates?

The upshot? Yes, verily, social networks can propagate positive and negative feelings!

The other upshot: Facebook intentionally made thousands upon thousands of people sad.

Or consider an article in the The Wire, quoting Jacob Silverman:

“What’s disturbing about how Facebook went about this, though, is that they essentially manipulated the sentiments of hundreds of thousands of users without asking permission (blame the terms of service agreements we all opt into). This research may tell us something about online behavior, but it’s undoubtedly more useful for, and more revealing of, Facebook’s own practices.”

On Twitter, the reaction to the study has been similarly negative). A lot of people appear to be very upset at the revelation that Facebook would actively manipulate its users’ news feeds in a way that could potentially influence their emotions.

Why the concerns are misplaced

To my mind, the concerns expressed in the Slate piece and elsewhere are misplaced, for several reasons. First, they largely mischaracterize the study’s experimental procedures–to the point that I suspect most of the critics haven’t actually bothered to read the paper. In particular, the suggestion that Facebook “manipulated users’ emotions” is quite misleading. Framing it that way tacitly implies that Facebook must have done something specifically designed to induce a different emotional experience in its users. In reality, for users assigned to the experimental condition, Facebook simply removed a variable proportion of status messages that were automatically detected as containing positive or negative emotional words. Let me repeat that: Facebook removed emotional messages for some users. It did not, as many people seem to be assuming, add content specifically intended to induce specific emotions. Now, given that a large amount of content on Facebook is already highly emotional in nature–think about all the people sharing their news of births, deaths, break-ups, etc.–it seems very hard to argue that Facebook would have been introducing new risks to its users even if it had presented some of them with more emotional content. But it’s certainly not credible to suggest that replacing 10% – 90% of emotional content with neutral content constitutes a potentially dangerous manipulation of people’s subjective experience.

Second, it’s not clear what the notion that Facebook users’ experience is being “manipulated” really even means, because the Facebook news feed is, and has always been, a completely contrived environment. I hope that people who are concerned about Facebook “manipulating” user experience in support of research realize that Facebook is constantly manipulating its users’ experience. In fact, by definition, every single change Facebook makes to the site alters the user experience, since there simply isn’t any experience to be had on Facebook that isn’t entirely constructed by Facebook. When you log onto Facebook, you’re not seeing a comprehensive list of everything your friends are doing, nor are you seeing a completely random subset of events. In the former case, you would be overwhelmed with information, and in the latter case, you’d get bored of Facebook very quickly. Instead, what you’re presented with is a carefully curated experience that is, from the outset, crafted in such a way as to create a more engaging experience (read: keeps you spending more time on the site, and coming back more often). The items you get to see are determined by a complex and ever-changing algorithm that you make only a partial contribution to (by indicating what you like, what you want hidden, etc.). It has always been this way, and it’s not clear that it could be any other way. So I don’t really understand what people mean when they sarcastically suggest–as Katy Waldman does in her Slate piece–that “Facebook reserves the right to seriously bum you out by cutting all that is positive and beautiful from your news feed”. Where does Waldman think all that positive and beautiful stuff comes from in the first place? Does she think it spontaneously grows wild in her news feed, free from the meddling and unnatural influence of Facebook engineers?

Third, if you were to construct a scale of possible motives for manipulating users’ behavior–with the global betterment of society at one end, and something really bad at the other end–I submit that conducting basic scientific research would almost certainly be much closer to the former end than would the other standard motives we find on the web–like trying to get people to click on more ads. The reality is that Facebook–and virtually every other large company with a major web presence–is constantly conducting large controlled experiments on user behavior. Data scientists and user experience researchers at Facebook, Twitter, Google, etc. routinely run dozens, hundreds, or thousands of experiments a day, all of which involve random assignment of users to different conditions. Typically, these manipulations aren’t conducted in order to test basic questions about emotional contagion; they’re conducted with the explicit goal of helping to increase revenue. In other words, if the idea that Facebook would actively try to manipulate your behavior bothers you, you should probably stop reading this right now and go close your account. You also should definitely not read this paper suggesting that a single social message on Facebook prior to the last US presidential election the may have single-handedly increased national voter turn-out by as much as 0.6%). Oh, and you should probably also stop using Google, YouTube, Yahoo, Twitter, Amazon, and pretty much every other major website–because I can assure you that, in every single case, there are people out there who get paid a good salary to… yes, manipulate your emotions and behavior! For better or worse, this is the world we live in. If you don’t like it, you can abandon the internet, or at the very least close all of your social media accounts. But the suggestion that Facebook is doing something unethical simply by publishing the results of one particular experiment among thousands–and in this case, an experiment featuring a completely innocuous design that, if anything, is probably less motivated by a profit motive than most of what Facebook does–seems kind of absurd.

Fourth, it’s worth keeping in mind that there’s nothing intrinsically evil about the idea that large corporations might be trying to manipulate your experience and behavior. Everybody you interact with–including every one of your friends, family, and colleagues–is constantly trying to manipulate your behavior in various ways. Your mother wants you to eat more broccoli; your friends want you to come get smashed with them at a bar; your boss wants you to stay at work longer and take fewer breaks. We are always trying to get other people to feel, think, and do certain things that they would not otherwise have felt, thought, or done. So the meaningful question is not whether people are trying to manipulate your experience and behavior, but whether they’re trying to manipulate you in a way that aligns with or contradicts your own best interests. The mere fact that Facebook, Google, and Amazon run experiments intended to alter your emotional experience in a revenue-increasing way is not necessarily a bad thing if in the process of making more money off you, those companies also improve your quality of life. I’m not taking a stand one way or the other, mind you, but simply pointing out that without controlled experimentation, the user experience on Facebook, Google, Twitter, etc. would probably be very, very different–and most likely less pleasant. So before we lament the perceived loss of all those “positive and beautiful” items in our Facebook news feeds, we should probably remind ourselves that Facebook’s ability to identify and display those items consistently is itself in no small part a product of its continual effort to experimentally test its offering by, yes, experimentally manipulating its users’ feelings and thoughts.

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface. Given that Facebook has over half a billion users, it’s a foregone conclusion that every tiny change Facebook makes to the news feed or any other part of its websites induces a change in millions of people’s emotions. Yet nobody seems to complain about this much–presumably because, when you put it this way, it seems kind of silly to suggest that a company whose business model is predicated on getting its users to use its product more would do anything other than try to manipulate its users into, you know, using its product more.

Why the backlash is deeply counterproductive

Now, none of this is meant to suggest that there aren’t legitimate concerns one could raise about Facebook’s more general behavior–or about the immense and growing social and political influence that social media companies like Facebook wield. One can certainly question whether it’s really fair to expect users signing up for a service like Facebook’s to read and understand user agreements containing dozens of pages of dense legalese, or whether it would make sense to introduce new regulations on companies like Facebook to ensure that they don’t acquire or exert undue influence on their users’ behavior (though personally I think that would be unenforceable and kind of silly). So I’m certainly not suggesting that we give Facebook, or any other large web company, a free pass to do as it pleases. What I am suggesting, however, is that even if your real concerns are, at bottom, about the broader social and political context Facebook operates in, using this particular study as a lightning rod for criticism of Facebook is an extremely counterproductive, and potentially very damaging, strategy.

Consider: by far the most likely outcome of the backlash Facebook is currently experiencing is that, in future, its leadership will be less likely to allow its data scientists to publish their findings in the scientific literature. Remember, Facebook is not a research institute expressly designed to further understanding of the human condition; it’s a publicly-traded corporation that exists to create wealth for its shareholders. Facebook doesn’t have to share any of its data or findings with the rest of the world if it doesn’t want to; it could comfortably hoard all of its knowledge and use it for its own ends, and no one else would ever be any wiser for it. The fact that Facebook is willing to allow its data science team to spend at least some of its time publishing basic scientific research that draws on Facebook’s unparalleled resources is something to be commended, not criticized.

There is little doubt that the present backlash will do absolutely nothing to deter Facebook from actually conducting controlled experiments on its users, because A/B testing is a central component of pretty much every major web company’s business strategy at this point–and frankly, Facebook would be crazy not to try to empirically determine how to improve user experience. What criticism of the Kramer et al article will almost certainly do is decrease the scientific community’s access to, and interaction with, one of the largest and richest sources of data on human behavior in existence. You can certainly take a dim view of Facebook as a company if you like, and you’re free to critique the way they do business to your heart’s content. But haranguing Facebook and other companies like it for publicly disclosing scientifically interesting results of experiments that it is already constantly conducting anyway–and that are directly responsible for many of the positive aspects of the user experience–is not likely to accomplish anything useful. If anything, it’ll only ensure that, going forward, all of Facebook’s societally relevant experimental research is done in the dark, where nobody outside the company can ever find out–or complain–about it.

[UPDATE July 1st: I’ve posted some additional thoughts in a second post here.]

There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)

This is not a blog post about bullying, negative psychology or replication studies in general. Those are important issues, and a lot of ink has been spilled over them in the past week or two. But this post isn’t about those issues (at least, not directly). This post is about ceiling effects. Specifically, the ceiling effect purportedly present in a paper in Social Psychology, in which Johnson, Cheung, and Donnellan report the results of two experiments that failed to replicate an earlier pair of experiments by Schnall, Benton, and Harvey.

If you’re not up to date on recent events, I recommend reading Vasudevan Mukunth’s post, which provides a nice summary. If you still want to know more after that, you should probably take a gander at the original paper by Schnall, Benton, & Harvey and the replication paper. Still want more? Go read Schnall’s rebuttal. Then read the rejoinder to the rebuttal. Then read Schnall’s first and second blog posts. And maybe a number of other blog posts (here, here, here, and here). Oh, and then, if you still haven’t had enough, you might want to skim the collected email communications between most of the parties in question, which Brian Nosek has been kind enough to curate.

I’m pointing you to all those other sources primarily so that I don’t have to wade very deeply into the overarching issues myself–because (a) they’re complicated, (b) they’re delicate, and (c) I’m still not entirely sure exactly how I feel about them. However, I do have a fairly well-formed opinion about the substantive issue at the center of Schnall’s published rebuttal–namely, the purported ceiling effect that invalidates Johnson et al’s conclusions. So I thought I’d lay that out here in excruciating detail. I’ll warn you right now that if your interests lie somewhere other than the intersection of psychology and statistics (which they probably should), you probably won’t enjoy this post very much. (If your interests do lie at the intersection of psychology and statistics, you’ll probably give this post a solid “meh”.)

Okay, with all the self-handicapping out of the way, let’s get to it. Here’s what I take to be…

Schnall’s argument

The crux of Schnall’s criticism of the Johnson et al replication is a purported ceiling effect. What, you ask, is a ceiling effect? Here’s Schnall’s definition:

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7″, this suggests that they might have given a higher response (e.g., “8″ or “9″) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

This definition has some subtle-but-important problems we’ll come back to, but it’s reasonable as a first approximation. With this definition in mind, here’s how Schnall describes her core analysis, which she uses to argue that Johnson et al’s results are invalid:

Because a ceiling effect on a dependent variable can wash out potential effects of an independent variable (Hessling, Traxel & Schmidt, 2004), the relationship between the percentage of extreme responses and the effect of the cleanliness manipulation was examined. First, using all 24 item means from original and replication studies, the effect of the manipulation on each item was quantified. … Second, for each dilemma the percentage of extreme responses averaged across neutral and clean conditions was computed. This takes into account the extremity of both conditions, and therefore provides an unbiased indicator of ceiling per dilemma. … Ceiling for each dilemma was then plotted relative to the effect of the cleanliness manipulation (Figure 1).

We can (and will) quibble with these analysis choices, but the net result of the analysis is this:


Here, we see normalized effect size (y-axis) plotted against extremity of item response (x-axis). Schnall’s basic argument is that there’s a strong inverse relationship between the extremity of responses to an item and the size of the experimental effect on that item. In other words, items with extreme responses don’t show an effect, whereas items with non-extreme responses do show an effect. She goes on to note that this pattern is full accounted for by her own original experiments, and that there is no such relationship in Johnson et al’s data. On the basis of this finding, Schnall concludes that:

Scores are compressed toward the top end of the scale and therefore show limited determinate variance near ceiling. Because a significance test compares variance due to a manipulation to variance due to error, an observed lack of effect can result merely from a lack in variance that would normally be associated with a manipulation. Given the observed ceiling effect, a statistical artefact, the analyses reported by Johnson et al. (2014a) are invalid and allow no conclusions about the reproducibility of the original findings.

Problems with the argument

One can certainly debate over what the implications would be even if Schnall’s argument were correct; for instance, it’s debatable whether the presence of a ceiling effect would actually invalidate Johnson et al’s conclusions that they had failed to replicate Schnall et al. An alternative and reasonable interpretation is that Johnson et al would have simply identified important boundary conditions under which the original effect doesn’t work (e.g., that it doesn’t hold in Michigan residents), since they were using Schnall’s original measures. But we don’t have to worry about that in any case, because there are several serious problems with Schnall’s argument. Some of them have to do with the statistical analysis she performs to make her point; some of them have to do with subtle mischaracterizations of what ceiling effects are and where they come from; and some of them have to do with the fact that Schnall’s data actually directly contradict her own argument. Let’s take each of these in turn.

Problems with the analysis

A first problem with Schnall’s analysis is that the normalization procedure she uses to make her point is biased. Schnall computes the normalized effect size for each item as:

(M1 – M2)/(M1 + M2)

Where M1 and M2 are the means for each item in the two experimental conditions (neutral and clean). This transformation is supposed to account for the fact that scores are compressed at the upper end of the scale, near the ceiling.

What Schnall fails to note, however, is that compression should also occur at the bottom of the scale, near the floor. For example, suppose an individual item has means of 1.2 and 1.4. Then Schnall’s normalized effect size estimate would be 0.2/2.6 = 0.07. But if the means had been 4.0 and 4.2–the same relative difference–then the adjusted estimate would actually be much smaller (around 0.02). So Schnall’s analysis is actually biased in favor of detecting the negative correlation she takes as evidence of a ceiling effect, because she’s not accounting for floor effects simultaneously. A true “clipping” or compression of scores shouldn’t occur at only one extreme of the scale; what should matter is how far from the midpoint a response happens to be. What should happen, if Schnall were to recompute the scores in Figure 1 using a modified criterion (e.g., relative deviation from the scale’s midpoint, rather than absolute score), is that the points at the top left of the figure should pull towards the y-axis to some degree, effectively reducing the slope she takes as evidence of a problem. If there’s any pattern that would suggest a measurement problem, it’s actually an inverted u-shape, where normalized effects are greatest for items with means nearest the midpoint, and smallest for items at both extremes, not just near ceiling. But that’s not what we’re shown.

A second problem is that Schnall’s data actually contradict her own conclusion. She writes:

Across the 24 dilemmas from all 4 experiments, dilemmas with a greater percentage of extreme responses were associated with lower effect sizes (r = -.50, p = .01, two-tailed). This negative correlation was entirely driven by the 12 original items, indicating that the closer responses were to ceiling, the smaller was the effect of the manipulation (r = -.49, p = .10).4In contrast, across the 12 replication items there was no correlation (r = .11, p = .74).

But if anything, these results provide evidence of a ceiling effect only in Schnall’s original study, and not in the Johnson et al replications. Recall that Schnall’s argument rests on two claims: (a) effects are harder to detect the more extreme responding on an item gets, and (b) responding is so extreme on the items in the Johnson et al experiments that nothing can be detected. But the results she presents blatantly contradict the second claim. Had there been no variability in item means in the Johnson et al studies, Schnall could have perhaps argued that restriction of range is so extreme that it is impossible to detect any kind of effect. In practice, however, that’s not the case. There is considerable variability along the x-axis, and in particular, one can clearly see that there are two items in Johnson et al that are nowhere near ceiling and yet show no discernible normalized effect of experimental condition at all. Note that these are the very same items that show some of the strongest effects in Schnall’s original study. In other words, the data Schnall presents in support of her argument actually directly contradict her argument. If one is to believe that a ceiling effect is preventing Schnall’s effect from emerging in Johnson et al’s replication studies, then there is no reasonable explanation for the fact that those two leftmost red squares in the figure above are close to the y = 0 line. They should be behaving exactly like they did in Schnall’s study–which is to say, they should be showing very large normalized effects–even if items at the very far right show no effects at all.

Third, Schnall’s argument that a ceiling effect completely invalidates Johnson et al’s conclusions is a gross exaggeration. Ceiling effects are not all-or-none; the degree of score compression into the upper end of a measure will vary continuously (unless there is literally no variance at all in the reponses, which is clearly not the case here). Even if we took at face value Schnall’s finding that there’s an inverse relationship between effect size and extremity in her original data (r = -0.5), all this would tell us is that there’s some compression of scores. Schnall’s suggestion that “given the observed ceiling effect, a statistical artifact, the analyses reported in Johnson et al (2014a) are invalid and allow no conclusions about the reproducibility of the original findings” is simply false. Even in the very best case scenario (which this obviously isn’t), the very strongest claim Schnall could comfortably make is that there may be some compression of scores, with unknown impact on the detectable effect size. It is simply not credible for Schnall to suggest that the mere presence of something that looks vaguely like a ceiling effect is sufficient to completely rule out detection of group differences in the Johnson et al experiments. And we know this with 100% certainty, because…

There are robust group differences in the replication experiments

Perhaps the clearest refutation of Schnall’s argument for a ceiling effect is that, as Johnson et al noted in their rejoinder, the Johnson et al experiments did in fact successfully identify some very clear group differences (and, ironically, ones that were also present in Schnall’s original experiments). Specifically, Johnson et al showed a robust effect of gender on vignette ratings. Here’s what the results look like:

We can see clearly that, in both replication experiments, there’s a large effect of gender but no discernible effect of experimental condition. This pattern directly refutes Schnall’s argument. She cannot have it both ways: if a ceiling effect precludes the presence of group differences, then there cannot be a ceiling effect in the replication studies, or else the gender effect could not have emerged repeatedly. Conversely, if ceiling effects don’t preclude detection of effects, then there is no principled reason why Johnson et al would fail to detect Schnall’s original effect.

Interestingly, it’s not just the overall means that tell the story quite clearly. Here’s what happens if we plot the gender effects in Johnson et al’s experiments in the same way as Schnall’s Figure 1 above:


Notice that we see here the same negative relationship between effect size and extremity that Schnall observed in her own data, and whose absence in Johnson et al’s data she (erroneously) took as evidence of a ceiling effect.

There’s a ceiling effect in Schnall’s own data

Yet another flaw in Schnall’s argument is that taking the ceiling effect charge seriously would actually invalidate at least one of her own experiments. Consider that the only vignette in Schnall et al’s original Experiment 1 that showed a statistically significant effect also had the highest rate of extreme responding in that study (mean rating of 8.25 / 9). Even more strikingly, the proportion of participants who gave the most extreme response possible on that vignette (70%) was higher than for any of the vignettes in either of Johnson et al’s experiments. In other words, Schnall’s core argument is that her effect could not possibly be replicated in Johnson et al’s experiments because of the presence of a ceiling effect, yet the only vignette to show a significant effect in Schnall’s original Experiment 1 had an even more pronounced ceiling effect. Once again, she cannot have it both ways. Either ceiling effects don’t preclude detection of effects, or, by Schnall’s own logic, the original Study 1 effect was probably a false positive.

When pressed on this point by Daniel Lakens in the email thread, Schnall gave the following response:

Note for the original studies we reported that the effect was seen on aggregate data, not necessarily for individual dilemmas. Such results will always show statistical fluctuations at the item level, hence it is important to not focus on any individual dilemma but on the overall pattern.

I confess that I’m not entirely clear on what Schnall means here. One way to read this is that she is conceding that the significant effect in the vignette in question (the “kitten” dilemma) was simply due to random fluctuations. Note that since the effect in Schnall’s Experiment 1 was only barely significant when averaging across all vignettes (in fact, it wasn’t quite significant even so), eliminating this vignette from consideration would actually have produced a null result. But suppose we overlook that and instead agree with Schnall that strange things can happen to individual items, and that what we should focus on is the aggregate moral judgment, averaged across vignettes. That would be perfectly reasonable, except that it’s directly at odds with Schnall’s more general argument. To see this, we need only look at the aggregate distribution of scores in Johnson et al’s Experiments 1 and 2:


There’s clearly no ceiling effect here; the mode in both experiments is nowhere near the maximum. So once again, Schnall can’t have it both ways. If her argument is that what matters is the aggregate measure (which seems right to me, since many reputable measures have multiple individual items with skewed distributions, and this can even be a desirable property in certain cases), then there’s nothing objectionable about the scores in the Johnson et al experiments. Conversely, if Schnall’s argument is that it’s fair to pick on individual items, then there is effectively no reason to believe Schnall’s own original Experiment 1 (and for all I know, her experiment 2 as well–I haven’t looked).

What should we conclude?

What can we conclude from all this? A couple of things. First, Schnall has no basis for arguing that there was a fundamental statistical flaw that completely invalidates Johnson et al’s conclusions. From where I’m sitting, there doesn’t seem to be any meaningful ceiling effect in Johnson et al’s data, and that’s attested to by the fact that Johnson et al had no trouble detecting gender differences in both experiments (successfully replicating Schnall’s earlier findings). Moreover, the arguments Schnall makes in support of the postulated ceiling effects suffer from serious flaws. At best, what Schnall could reasonably argue is that there might be some restriction of range in the ratings, which would artificially reduce the effect size. However, given that Johnson et al’s sample sizes were 3 – 5 times larger than Schnall’s, it is highly implausible to suppose that effects as big as Schnall’s completely disappeared–especially given that robust gender effects were detected. Moreover, given that the skew in Johnson et al’s aggregate distributions is not very extreme at all, and that many individual items on many questionnaire measures show ceiling or floor effects (e.g., go look at individual Big Five item distributions some time), taking Schnall’s claims seriously one would in effect invalidate not just Johnson et al’s results, but also a huge proportion of the more general psychology literature.

Second, while Schnall has raised a number of legitimate and serious concerns about the tone of the debate and comments surrounding Johnson et al’s replication, she’s also made a number of serious charges of her own that depend on the validity of her argument about celing effects, and not on the civility (or lack thereof) of commentators on various sides of the debate. Schnall has (incorrectly) argued that Johnson et al have committed a basic statistical error that most peer reviewers would have caught–effectively accusing them of incompetence. She has argued that Johnson et al’s claim of replication failure is unwarranted, and constitutes defamation of her scientific reputation. And she has suggested that the editors of the special issue (Daniel Lakens and Brian Nosek) behaved unethically by first not seeking independent peer review of the replication paper, and then actively trying to suppress her own penetrating criticisms. In my view, none of these accusations are warranted, because they depend largely on Schnall’s presumption of a critical flaw in Johnson et al’s work that is in fact nonexistent. I understand that Schnall has been under a lot of stress recently, and I sympathize with her concerns over unfair comments made by various people (most of whom have now issued formal apologies). But given the acrimonious tone of the more general ongoing debate over replication, it’s essential that we distinguish the legitimate issues from the illegitimate ones so that we can focus exclusively on the former, and don’t end up needlessly generating more hostility on both sides.

Lastly, there is the question of what conclusions we should draw from the Johnson et al replication studies. Personally, I see no reason to question Johnson et al’s conclusions, which are actually very modest:

In short, the current results suggest that the underlying effect size estimates from these replication experiments are substantially smaller than the estimates generated from the original SBH studies. One possibility is that there are unknown moderators that account for these apparent discrepancies. Perhaps the most salient difference betweenthe current studies and the original SBH studies is the student population. Our participants were undergraduates inUnited States whereas participants in SBH’sstudies were undergraduates in the United Kingdom. It is possible that cultural differences in moral judgments or in the meaning and importance of cleanliness may explain any differences.

Note that Johnson et al did not assert or intimate in any way that Schnall et al’s effects were “not real”. They did not suggest that Schnall et al had committed any errors in their original study. They explicitly acknowledged that unknown moderators might explain the difference in results (though they also noted that this was unlikely considering the magnitude of the differences). Effectively, Johnson et al stuck very close to their data and refrained from any kind of unfounded speculation.

In sum, unless Schnall has other concerns about Johnson’s data besides the purported ceiling effect (and she hasn’t raised any that I’ve seen), I think Johnson et al’s paper should enter the record exactly as its authors intended. Johnson, Cheung, & Donnellan (2014) is, quite simply, a direct preregistered replication of Schnall, Benton, & Harvey (2008) that failed to detect the effects reported in the original study, and there should be nothing at all controversial about this. There are certainly worthwhile discussions to be had about why the replication failed, and what that means for the original effect, but this doesn’t change the fundamental fact that the replication did fail, and we shouldn’t pretend otherwise.

Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.


So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

estimating the influence of a tweet–now with 33% more causal inference!

Twitter is kind of a big deal. Not just out there in the world at large, but also in the research community, which loves the kind of structured metadata you can retrieve for every tweet. A lot of researchers rely heavily on twitter to model social networks, information propagation, persuasion, and all kinds of interesting things. For example, here’s the abstract of a nice recent paper on arXiv that aims to  predict successful memes using network and community structure:

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

One limitation of much of this body of research is that the data are almost invariably observational. We can build sophisticated models that do a good job predicting some future outcome (like meme success), but we don’t necessarily know that the “important” features we identify carry any causal influence. In principle, they could be completely epiphenomenal–for example, in the study I linked to, maybe the community structure features are just a proxy for some other, causally important, factor (e.g., whether the content of a meme has sufficiently broad appeal to attract attention from many different kinds of people). From a predictive standpoint, this may not matter much; if your goal is just to passively predict whether a meme is going to be successful or not, it’s irrelevant whether or not the features you’re using are doing causal work. On the other hand, if you want to actively design memes in such a way as to maximize their spread, the ability to get a handle on causation starts to look pretty important.

How can we estimate the direct causal influence of a tweet on the downstream popularity of a meme? Here’s a simple and (I suspect) very feasible idea in two steps:

  1. Create a small web app that allows any existing Twitter user to register via Twitter authentication. On signing up, a user has to specify just one (optional) setting: the proportion of their intended retweets they’re willing to withhold. Let’s this the Withholding Fraction (WF).
  2. Every time (or at least some of the time) a registered user wants to retweet a particular tweet*, they do so via the new web app’s interface (which has permission to post to the user’s Twitter account) instead of whatever interface they’re currently using. The key is that the retweet isn’t just obediently passed along; instead, the target tweet is retweeted successfully with probability (1 – WF), and randomly suppressed from the user’s stream with probability (WF).

Doing this  would allow the community to very quickly (assuming rapid adoption, which seems reasonably likely) build up an enormous database of tweets that were targeted for retweeting by an active user, but randomly assigned to fail with some known probability. Researchers would then be able to directly quantify the causal impact of individual retweets on downstream popularity–and to estimate that influence conditional on all of the other standard variables, like the retweeter’s number of followers, the content of the tweet, etc. Of course, this still wouldn’t get us to true experimental manipulation of such features (i.e., we wouldn’t be manipulating users’ follower networks, just randomly omitting tweets from users with different followers), but it seems like a step in the right direction**.

I figure building a barebones app like this would take an experienced developer familiar with the Twitter OAuth API just a day or two. And I suspect many people (myself included!) would be happy to contribute to this kind of experiment, provided that all of the resulting data were made public. (I’m aware that there are all kinds of restrictions on sharing assembled Twitter datasets, but we’re not talking about sharing firehose dumps here, just a restricted set of retweets from users who’ve explicitly given their consent to have the data used in this way.)

Has this kind of thing already been done? If not, does anyone want to build it?


* It doesn’t just have to be retweets, of course; the same principle would work just as well for withholding a random fraction of original tweets. But I suspect not many users would be willing to randomly eliminate a proportion of their original content from the firehose.

** If we really wanted to get close to true random assignment, we could potentially inject selected tweets into random users streams based on selected criteria. But I’m not sure how many tweeps would consent to have entirely random retweets published in their name (I probably wouldn’t), so this probably isn’t viable.

a blog about minds, brains, data & stuff