Tag: False Record Effect

Mental Model: Bias from Insensitivity to Sample Size

The widespread misunderstanding of randomness causes a lot of problems.

Today we're going to explore a concept that causes a lot of human misjudgment. It’s called the bias from insensitivity to sample size, or, if you prefer,the law of small numbers.

Insensitivity to small sample sizes causes a lot of problems.

* * *

If I measured one person, who happened to measure 6 feet, and then told you that everyone in the whole world was 6 feet, you’d intuitively realize this is a mistake. You’d say, you can’t measure only one person and then draw such a conclusion. To do that you’d need a much larger sample.

And, of course, you'd be right.

While simple, this example is a key building block to our understanding of how insensitivity to sample size can lead us astray.

As Stuard Suterhland writes in Irrationality:

Before drawing conclusions from information about a limited number of events (a sample) selected from a much larger number of events (the population) it is important to understand something about the statistics of samples.

In Thinking, Fast and Slow, Daniel Kahneman writes “A random event, by definition, does not lend itself to explanation, but collections of random events do behave in a highly regular fashion.” Kahnemen continues, “extreme outcomes (both high and low) are more likely to be found in small than in large samples. This explanation is not causal.”

We all intuitively know that “the results of larger samples deserve more trust than smaller samples, and even people who are innocent of statistical knowledge have heard about this law of large numbers.”

The principle of regression to the mean says that as the sample size grows larger results should converge to a stable frequency. So, if we’re flipping coins, and measuring the proportion of times that we get heads, we’d expect it to approach 50% after some large sample size of, say, 100 but not necessarily 2 or 4.

In our minds, we often fail to account for the accuracy and uncertainty with a given sample size.

While we all understand it intuitively, it’s hard for us to realize in the moment of processing and decision making that larger samples are better representations than smaller samples.

We understand the difference between a sample size of 6 and 6,000,000 fairly well but we don't, intuitively, understand the difference between 200 and 3,000.

* * *

This bias comes in many forms.

In a telephone poll of 300 seniors, 60% support the president.

If you had to summarize the message of this sentence in exactly three words, what would they be? Almost certainly you would choose “elderly support president.” These words provide the gist of the story. The omitted details of the poll, that it was done on the phone with a sample of 300, are of no interest in themselves; they provide background information that attracts little attention.” Of course, if the sample was extreme, say 6 people, you’d question it. Unless you’re fully mathematically equipped, however, you’ll intuitively judge the sample size and you may not react differently to a sample of, say, 150 and 3000. That, in a nutshell, is exactly the meaning of the statement that “people are not adequately sensitive to sample size.”

Part of the problem is that we focus on the story over reliability, or, robustness, of the results.

System one thinking, that is our intuition, is “not prone to doubt. It suppresses ambiguity and spontaneously constructs stories that are as coherent as possible. Unless the message is immediately negated, the associations that it evokes will spread as if the message were true.”

Considering sample size, unless it’s extreme, is not a part of our intuition.

Kahneman writes:

The exaggerated faith in small samples is only one example of a more general illusion – we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify. Jumping to conclusions is a safer sport in the world of our imagination than it is in reality.

* * *

In engineering, for example, we can encounter this in the evaluation of precedent.

Steven Vick, writing in Degrees of Belief: Subjective Probability and Engineering Judgment, writes:

If something has worked before, the presumption is that it will work again without fail. That is, the probability of future success conditional on past success is taken as 1.0. Accordingly, a structure that has survived an earthquake would be assumed capable of surviving with the same magnitude and distance, with the underlying presumption being that the operative causal factors must be the same. But the seismic ground motions are quite variable in their frequency content, attenuation characteristics, and many other factors, so that a precedent for a single earthquake represents a very small sample size.

Bayesian reasoning tells us that a single success, absent of other information, raises the likelihood of survival in the future.

In a way this is related to robustness. The more you’ve had to handle and you still survive the more robust you are.

Let’s look at some other examples.

* * *

Hospital

Daniel Kahneman and Amos Tversky demonstrated our insensitivity to sample size with the following question:

A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days?

  1. The larger hospital
  2. The smaller hospital
  3. About the same (that is, within 5% of each other)

Most people incorrectly choose 3. The correct answer is, however, 2.

In Judgment in Managerial Decision Making, Max Bazerman explains:

Most individuals choose 3, expecting the two hospitals to record a similar number of days on which 60 percent or more of the babies board are boys. People seem to have some basic idea of how unusual it is to have 60 percent of a random event occurring in a specific direction. However, statistics tells us that we are much more likely to observe 60 percent of male babies in a smaller sample than in a larger sample.” This effect is easy to understand. Think about which is more likely: getting more than 60 percent heads in three flips of coin or getting more than 60 percent heads in 3,000 flips.

* * *

Another interesting example comes from Poker.

Over short periods of time luck is more important than skill. The more luck contributes to the outcome, the larger the sample you’ll need to distinguish between someone’s skill and pure chance.

David Einhorn explains.

People ask me “Is poker luck?” and “Is investing luck?”

The answer is, not at all. But sample sizes matter. On any given day a good investor or a good poker player can lose money. Any stock investment can turn out to be a loser no matter how large the edge appears. Same for a poker hand. One poker tournament isn’t very different from a coin-flipping contest and neither is six months of investment results.

On that basis luck plays a role. But over time – over thousands of hands against a variety of players and over hundreds of investments in a variety of market environments – skill wins out.

As the number of hands played increases, skill plays a larger and larger role and luck plays less of a role.

* * *

But this goes way beyond hospitals and poker. Baseball is another good example. Over a long season, odds are the best teams will rise to the top. In the short term, anything can happen. If you look at the standing 10 games into the season, odds are they will not be representative of where things will land after the full 162 game season. In the short term, luck plays too much of a role.

In Moneyball, Michael Lewis writes “In a five-game series, the worst team in baseball will beat the best about 15% of the time.”

* * *

If you promote people or work with colleagues you’ll also want to keep this bias in mind.

If you assume that performance at work is some combination of skill and luck you can easily see that sample size is relevant to the reliability of performance.

That performance sampling works like anything else, the bigger the sample size the bigger the reduction in uncertainty and the more likely you are to make good decisions.

This has been studied by one of my favorite thinkers, James March. He calls it the false record effect.

He writes:

False Record Effect. A group of managers of identical (moderate) ability will show considerable variation in their performance records in the short run. Some will be found at one end of the distribution and will be viewed as outstanding; others will be at the other end and will be viewed as ineffective. The longer a manager stays in a job, the less the probable difference between the observed record of performance and actual ability. Time on the job increased the expected sample of observations, reduced expected sampling error, and thus reduced the change that the manager (or moderate ability) will either be promoted or exit.

Hero Effect. Within a group of managers of varying abilities, the faster the rate of promotion, the less likely it is to be justified. Performance records are produced by a combination of underlying ability and sampling variation. Managers who have good records are more likely to have high ability than managers who have poor records, but the reliability of the differentiation is small when records are short.

(I realize promotions are a lot more complicated than I’m letting on. Some jobs, for example, are more difficult than others. It gets messy quickly and that's part of the problem. Often when things get messy we turn off our brains and concoct the simplest explanation we can. Simple but wrong. I’m only pointing out that sample size is one input into the decision. I’m by no means advocating an “experience is best” approach, as that comes with a host of other problems.)

* * *

This bias is also used against you in advertising.

The next time you see a commercial that says “4 out of 5 Doctors recommend ….” These results are meaningless without knowing the sample size. Odds are pretty good that the sample size is 5.

* * *

Large sample sizes are not a panacea. Things change. Systems evolve and faith in those results can be unfounded as well.

The key, at all times, is to think.

This bias leads to a whole slew of things, such as:
– under-estimating risk
– over-estimating risk
– undue confidence in trends/patterns
– undue confidence in the lack of side-effects/problems

The Bias from insensitivity to sample size is part of the Farnam Street latticework of mental models.

Promoting People In Organizations

Promoting People
Do you hire, fire, or promote people? Read on.

In their 1978 paper Performance Sampling in Social Matches, researchers March and March discussed the implications of performance sampling for understanding careers in organizations.

Considerable evidence exists documenting that individuals confronted with problems requiring the estimation of proportions act as though sample size were substantially irrelevant to the reliability of their estimates. We do this in hiring all the time. Yet we know that sample size matters.

On how this cognitive bias affects hiring, March and March write:

False Record Effect

A group of managers of identical (moderate) ability will show considerable variation in their performance records in the short run. Some will be found at one end of the distribution and will be viewed as outstanding; others will be at the other end and will be viewed as ineffective. The longer a manager stays in a job, the less the probable difference between the observed record of performance and actual ability. Time on the job increased the expected sample of observations, reduced expected sampling error, and thus reduced the chance that the manager (of moderate ability) will either be promoted or exit.

Hero Effect

Within a group of managers of varying abilities, the faster the rate of promotion, the less likely it is to be justified. Performance records are produced by a combination of underlying ability and sampling variation. Managers who have good records are more likely to have high ability than managers who have poor records, but the reliability of the differentiation is small when records are short.

Disappointment Effect

On the average, new managers will be a disappointment. The performance records by which managers are evaluated are subject to sampling error. Since a manager is promoted to a new job on the basis of a good previous record, the proportion of promoted managers whose past records are better than their abilities will be greater than the proportion whose past records are poorer. As a result, on the average, managers will do less well in their new jobs than they did in their old ones, and observers will come to believe that higher level jobs are more difficult than lower level ones, even if they are not.

…The present results reinforce the idea that indistinguishability among managers is a joint property of the individuals being evaluated and the process by which they are evaluated. Performance sampling models show how careers may be the consequences of erroneous interpretations of variations in performance produced by equivalent managers. But they also indicate that the same pattern of careers could be the consequence of unreliable evaluation of managers who do, in fact, differ, or of managers who do, in fact, learn over the course of their experience.

But hold on a second before you stop promoting new managers (who, by definition, have a limited sample size).

I'm not sure that sample size is the right way to think about this.

Consider two people: Manager A and Manager B who are up for promotion. Manager A has 10 years of experience and is an “all-star” (that is great performance with little variation in observations). Manager B, on the other hand, has only 5 years of experience but has shown a lot of variance in performance.

If you had to hire someone you'd likely pick A. But it's important not to misinterpret the results of March and March and dig a little deeper.

What if we add one more variable to our two managers.

Manager A's job has been “easy” whereas Manager B took a very “tough” assignment.

With this in mind, it seems reasonable to conclude that Manager B's variance in performance could be explained by the difficulty of their task. This could also explain the lack of variance in Manager A's performance.

Some jobs are tougher than others.

If you don't factor in degree-of-difficulty you're missing something big and sending a message to your workforce that discourages people from taking difficult assignments.