The widespread misunderstanding of randomness causes a lot of problems.
Today we’re going to explore a concept that causes a lot of human misjudgment. It’s called the bias from insensitivity to sample size, or, if you prefer,the law of small numbers.
Insensitivity to small sample sizes causes a lot of problems.
* * *
If I measured one person, who happened to measure 6 feet, and then told you that everyone in the whole world was 6 feet, you’d intuitively realize this is a mistake. You’d say, you can’t measure only one person and then draw such a conclusion. To do that you’d need a much larger sample.
And, of course, you’d be right.
While simple, this example is a key building block to our understanding of how insensitivity to sample size can lead us astray.
As Stuard Suterhland writes in Irrationality:
Before drawing conclusions from information about a limited number of events (a sample) selected from a much larger number of events (the population) it is important to understand something about the statistics of samples.
In Thinking, Fast and Slow, Daniel Kahneman writes “A random event, by definition, does not lend itself to explanation, but collections of random events do behave in a highly regular fashion.” Kahnemen continues, “extreme outcomes (both high and low) are more likely to be found in small than in large samples. This explanation is not causal.”
We all intuitively know that “the results of larger samples deserve more trust than smaller samples, and even people who are innocent of statistical knowledge have heard about this law of large numbers.”
The principle of regression to the mean says that as the sample size grows larger results should converge to a stable frequency. So, if we’re flipping coins, and measuring the proportion of times that we get heads, we’d expect it to approach 50% after some large sample size of, say, 100 but not necessarily 2 or 4.
In our minds, we often fail to account for the accuracy and uncertainty with a given sample size.
While we all understand it intuitively, it’s hard for us to realize in the moment of processing and decision making that larger samples are better representations than smaller samples.
We understand the difference between a sample size of 6 and 6,000,000 fairly well but we don’t, intuitively, understand the difference between 200 and 3,000.
* * *
This bias comes in many forms.
In a telephone poll of 300 seniors, 60% support the president.
If you had to summarize the message of this sentence in exactly three words, what would they be? Almost certainly you would choose “elderly support president.” These words provide the gist of the story. The omitted details of the poll, that it was done on the phone with a sample of 300, are of no interest in themselves; they provide background information that attracts little attention.” Of course, if the sample was extreme, say 6 people, you’d question it. Unless you’re fully mathematically equipped, however, you’ll intuitively judge the sample size and you may not react differently to a sample of, say, 150 and 3000. That, in a nutshell, is exactly the meaning of the statement that “people are not adequately sensitive to sample size.”
Part of the problem is that we focus on the story over reliability, or, robustness, of the results.
System one thinking, that is our intuition, is “not prone to doubt. It suppresses ambiguity and spontaneously constructs stories that are as coherent as possible. Unless the message is immediately negated, the associations that it evokes will spread as if the message were true.”
Considering sample size, unless it’s extreme, is not a part of our intuition.
The exaggerated faith in small samples is only one example of a more general illusion – we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify. Jumping to conclusions is a safer sport in the world of our imagination than it is in reality.
* * *
In engineering, for example, we can encounter this in the evaluation of precedent.
Steven Vick, writing in Degrees of Belief: Subjective Probability and Engineering Judgment, writes:
If something has worked before, the presumption is that it will work again without fail. That is, the probability of future success conditional on past success is taken as 1.0. Accordingly, a structure that has survived an earthquake would be assumed capable of surviving with the same magnitude and distance, with the underlying presumption being that the operative causal factors must be the same. But the seismic ground motions are quite variable in their frequency content, attenuation characteristics, and many other factors, so that a precedent for a single earthquake represents a very small sample size.
Bayesian reasoning tells us that a single success, absent of other information, raises the likelihood of survival in the future.
In a way this is related to robustness. The more you’ve had to handle and you still survive the more robust you are.
Let’s look at some other examples.
* * *
Daniel Kahneman and Amos Tversky demonstrated our insensitivity to sample size with the following question:
A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days?
- The larger hospital
- The smaller hospital
- About the same (that is, within 5% of each other)
Most people incorrectly choose 3. The correct answer is, however, 2.
In Judgment in Managerial Decision Making, Max Bazerman explains:
Most individuals choose 3, expecting the two hospitals to record a similar number of days on which 60 percent or more of the babies board are boys. People seem to have some basic idea of how unusual it is to have 60 percent of a random event occurring in a specific direction. However, statistics tells us that we are much more likely to observe 60 percent of male babies in a smaller sample than in a larger sample.” This effect is easy to understand. Think about which is more likely: getting more than 60 percent heads in three flips of coin or getting more than 60 percent heads in 3,000 flips.
* * *
Another interesting example comes from Poker.
Over short periods of time luck is more important than skill. The more luck contributes to the outcome, the larger the sample you’ll need to distinguish between someone’s skill and pure chance.
David Einhorn explains.
People ask me “Is poker luck?” and “Is investing luck?”
The answer is, not at all. But sample sizes matter. On any given day a good investor or a good poker player can lose money. Any stock investment can turn out to be a loser no matter how large the edge appears. Same for a poker hand. One poker tournament isn’t very different from a coin-flipping contest and neither is six months of investment results.
On that basis luck plays a role. But over time – over thousands of hands against a variety of players and over hundreds of investments in a variety of market environments – skill wins out.
As the number of hands played increases, skill plays a larger and larger role and luck plays less of a role.
* * *
But this goes way beyond hospitals and poker. Baseball is another good example. Over a long season, odds are the best teams will rise to the top. In the short term, anything can happen. If you look at the standing 10 games into the season, odds are they will not be representative of where things will land after the full 162 game season. In the short term, luck plays too much of a role.
In Moneyball, Michael Lewis writes “In a five-game series, the worst team in baseball will beat the best about 15% of the time.”
* * *
If you promote people or work with colleagues you’ll also want to keep this bias in mind.
If you assume that performance at work is some combination of skill and luck you can easily see that sample size is relevant to the reliability of performance.
That performance sampling works like anything else, the bigger the sample size the bigger the reduction in uncertainty and the more likely you are to make good decisions.
This has been studied by one of my favorite thinkers, James March. He calls it the false record effect.
False Record Effect. A group of managers of identical (moderate) ability will show considerable variation in their performance records in the short run. Some will be found at one end of the distribution and will be viewed as outstanding; others will be at the other end and will be viewed as ineffective. The longer a manager stays in a job, the less the probable difference between the observed record of performance and actual ability. Time on the job increased the expected sample of observations, reduced expected sampling error, and thus reduced the change that the manager (or moderate ability) will either be promoted or exit.
Hero Effect. Within a group of managers of varying abilities, the faster the rate of promotion, the less likely it is to be justified. Performance records are produced by a combination of underlying ability and sampling variation. Managers who have good records are more likely to have high ability than managers who have poor records, but the reliability of the differentiation is small when records are short.
(I realize promotions are a lot more complicated than I’m letting on. Some jobs, for example, are more difficult than others. It gets messy quickly and that’s part of the problem. Often when things get messy we turn off our brains and concoct the simplest explanation we can. Simple but wrong. I’m only pointing out that sample size is one input into the decision. I’m by no means advocating an “experience is best” approach, as that comes with a host of other problems.)
* * *
This bias is also used against you in advertising.
The next time you see a commercial that says “4 out of 5 Doctors recommend ….” These results are meaningless without knowing the sample size. Odds are pretty good that the sample size is 5.
* * *
Large sample sizes are not a panacea. Things change. Systems evolve and faith in those results can be unfounded as well.
The key, at all times, is to think.
This bias leads to a whole slew of things, such as:
– under-estimating risk
– over-estimating risk
– undue confidence in trends/patterns
– undue confidence in the lack of side-effects/problems
The Bias from insensitivity to sample size is part of the Farnam Street latticework of mental models.