These two tables contain identical data (take look at the numbers) and show the efficacy of two different experimental drugs. In each table, the data are placed in categories according to some third characteristic: gender or blood pressure. We are left wondering: does the drug help or harm patients? We find in Table 1.1 that all-population recovery rates are worse with the drug, but better in each subset. In table 1.2 we find all-population recovery rates are improved with the drug, but worse with the drug in each subset (note: these data are an example of Simpson's Paradox). So does the drug help or not? There is no statistical test to answer this. The scientist must bring their own strongly-convicted answer to one question: can the drug influence who is in each category?

  • For the table categorized by gender, we firmly answer no, the drug cannot not change the gender of the patient, therefore we can rely on categorized data.

  • For the table categorized by blood pressure, we strongly suspect yes, the drug can alter the patient's blood pressure, therefore we cannot rely on categorized data.

With this strong conviction established, we now determine that:

  • Table 1 supports the use of the drug - we read the categorized data for this conclusion.

  • Table 2 supports the use of the drug - we read the combined data row for the answer.

It is a mistake to categorize data using a quality that can be influenced by the drug under investigation.

Source: Jewell, N. P., Pearl, J., Glymour, M. (2016). Causal Inference in Statistics: A Primer. Germany: Wiley.


We insist that our students understand this distinction: the p-value is the probability of getting a result like this, assuming the null hypothesis is true; it is not the same thing as the probability our hypothesis is true, given that we’ve seen a result like this. When students in the medical school look at me as if this is a meaningless distinction, I say: did those two statements sound similar? They are as different as telling you that half of all Welsh people are women, or telling you that half of all women are Welsh people. Students are usually willing to accept that these two sentences are very different, even if they sound closely related. But I wonder if they understand why we statisticians place such an emphasis on the difference?

The recent buzz in scientific journals about an alleged replication crisis shows how widespread this misunderstanding is. The medical publishing world seems to be very surprised that studies that achieve “statistical significance” (defining significance with a 5% threshold and using 95% confidence intervals) can’t be replicated much of the time. Did we think that because we use a 5% threshold for statistical significance, and 95% confidence intervals, that means that 95% of positive findings studies should be successfully replicated? Or, to put it another way – did we think that because half of the Welsh people are women, it follows that half of the women will turn out to be Welsh? (source)

Scientists need some basis for doing a test and getting excited about the result. By convention, that basis is "only 1 chance in 20 this happened randomly." Another way of saying this is: 5% of the time, this system would have produced this result (and 95% of the time it behaves in the less-exciting, normal way). Now, it is critical that scientists be honest and not attempt the same experiment 20 times, then report just the one that worked. An excellent xkcd comic satirizes this behavior. When a "scientist" starts picking and choosing the experimental data until achieving a significant result this is ex-post selection, or p-hacking. A truthful scientist only performs ex-ante analysis, meaning they choose what to test before knowing any results.

So is there evidence that scientists are p-hacking? There was a bit of a fervor when this plot of a million z-scores was published. It seemed to give the impression that scientists were nudging the data around to get a result that was just barely significant enough to report. But this is not necessarily the case.

As Alex Keil explains: "It looks bad at first glance...but the implied counterfactual literature with a normal distribution around 0 would only happen if we studied associations at random. Instead, scientists often opt to study things where there is prior evidence to suggest a real effect. You can get close to that figure from an innocuous literature with these rules: if effect is known to be small, rarely study it. If studying, study effect once. If you get significant effect, study more. Here's a small, simulated literature; looks bad, but it's perfect science: "

Simulated distribution of 100% honest literature

"Now lets implement p-hacking: imagine any researcher who gets a z-score close to the critical value (within 10%) can find some way to game statistics to get a significant result. That's a lot more like the figure from the paper. That version of p-hacking is, at worst, fudging a value of .078 to something that is significant at a=0.05. Is it incorrect and possibly unethical? Yes. Is it leading to a literature of mostly incorrect results? No. It leads to a plot that looks bad only when focused on p<0.05. The literature implied by the awful looking plot of Z-scores is consistent with a nearly flawless literature. We all know it's not flawless, but we should also reject the vacuous idea that this means the literature is largely incorrect. This underscores the point that obsessing around statistical significance means losing sight of vastly larger range of results that should be considered in a continuum, rather than as binary decisions about whether an effect exists or not.

Simulated distribution of 90% honest literature

increase sample size

An unintuitive concept that must be acquired early: larger sample sizes increase confidence in a test, even if the larger sample includes more data points disproving the effect. Kahneman and Tversky nullified a decade of psychological research by showing that small effects in reading, comprehension, or cognition were far more likely to be a random noise than a real effect when studying 40-80 participants.

Vaccine (from William Feller's Introduction to Probability, page 150)

Suppose the normal rate of infection of a certain disease is 25%. Three new vaccines are tested but each is given to a different number of participants: 10, 17, and 23. In each case, the number of individuals sickened is recorded.

Now, for a completely useless vaccine, we expect an average of 2.5, 4.25, and 5.75 to become sickened, but sometimes it will be more, sometimes fewer. How often will none of ten become sick? About 5.6% of the time. Thus, we cannot even test at the 95% confidence interval with only ten participants. If one out of 17 is sickened, this is randomly expected only 5.1% of the time, so it is actually stronger evidence the vaccine is working than than 0 out of 10. For 23 vaccinated participants, let's say two still get sick. Is this third vaccine superior or inferior to the other two? It's actually superior: two sick out of 23 occurs only 4.9% of the time by random chance.

Note that these three effects (5.6%, 5.1%, 4.9%) are selected to be close intentionally to emphasize how increasing the sample size dramatically increases the significance even if the larger sample acquires more contrapositive results. "If you don't like the result you got, you can always look for a smaller, noisier dataset."

don't assume data is normally distributed

As impressionable young students, we are introduced to the bell-curve probability distribution, which eventually will gain the more-advanced name: Gaussian distribution. We inadvertently acquire a tendency to assume anything producing a range of outcomes likely produces a Gaussian distribution (hereafter: 'normal' distribution).

This is terribly mistaken.

In fact, we would be safer to assume virtually nothing is normally distributed. The demise of a multi-billion dollar hedge fund was brought about by the assumption that large market movements were very rare (1 chance in 10²³ years). Another way of saying it is this rare is to say it is 10 standard deviations from the average, or a "10 σ event." The fact that the markets have only been in operation for a couple hundred years and yet such a rare event was witnessed should strike suspicion in one's mind. Indeed, the market movements are not actually normally distributed, they are power-law distributed. Another name for this type of distribution is a fat-tailed distribution. We might forgive the folks who lost the billions of dollars at Long Term Capital Management for not noticing the difference between a normal and a logistic (or log) distribution. After all they look very similar:

further reading on Wittgenstein's Ruler: Nassim Nicholas Taleb's book Skin In The Game

One of these is a normal distribution, one is the logistic distribution. Can you tell which is which? They are similar in appearance, but on each side of the peak the proximity of the tails to the x-axis cannot be discerned on a linear scale.

For further reading on distributions, see Ryan Moulton's blog: Logs, Tails, Long Tails

On a logarithmic y-axis, they look very different. Now consider the probability of a far-from-average event: the normal distribution (left) declines faster than linear in log space. This means the event is becoming less and less likely faster than merely dividing by ten. One step further from the mean is an additional reduction in probability by more than a factor of ⅒, where as a log distribution is "only" a factor of ⅒ reduction.

Thus, assuming market movements are normally-distributed is to assume very large single-day movements are essentially zero-probability. But what if Long Term Capital Management had assumed a log distribution in their risk model? The market event that caused their implosion had a likelihood of 1 in 203 years, not 1 in 10²³ years. Perhaps they would have taken measures to safeguard against the perilous event at this frequency?

It should never be assumed that a dataset is normally-distributed. Word usage frequency, war casualties, book sales, frequency of surname appearance, wealth distribution and solar flare intensity all follow log distributions (to name a few). The one exception is when a device is set to repeatedly measure the same thing, e.g., the mass of a gold coin (which isn't changing between measurements). In this case alone, one may safely assume normally-distributed data.

Figures from Clauset, Shalizi & Newman 2009 (link)

correlation ≠ causation

One of the commonest techniques for demonstrating a relationship between two variables is the Pearson r² correlation coefficient. If one of those two variables is considered an input or cause of the other, one might cautiously conclude that there is a causal relationship. However, two common errors occur:

  1. Assuming a causal relationship between variables that are related, but not causal. For example, living near a park or garden raises kids' IQ (it doesn't.) Wet streets don't cause lightning; wet streets and lightning are related to the presence of thunderstorms.

  2. Assuming that a correlation coefficient of 0.5 or higher is significant. In fact, a correlation of 0.5 has only 6-14% of the information of a correlation coefficient of 0.7 (if 1 = complete information and 0 = no information)

Perhaps the most-convincing explanation of correlation ≠ causation is to browse Tyler Vigen's collection of spurious correlations, all with reasonably high r².

Our goal in learning all of this?

To NOT be a Federico: