Tag: Robyn Dawes

Do Algorithms Beat Us at Complex Decision Making?

Algorithms are all the rage these days. AI researchers are taking more and more ground from humans in areas like rules-based games, visual recognition, and medical diagnosis. However, the idea that algorithms make better predictive decisions than humans in many fields is a very old one.

In 1954, the psychologist Paul Meehl published a controversial book with a boring sounding name: Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review of the Evidence.

The controversy? After reviewing the data, Meehl claimed that mechanical, data-driven algorithms could better predict human behavior than trained clinical psychologists — and with much simpler criteria. He was right.

The passing of time has not been friendly to humans in this game: Studies continue to show that the algorithms do a better job than experts in a range of fields. In Daniel Kahneman's Thinking Fast and Slow, he details a selection of fields which have demonstrated inferior human judgment compared to algorithms:

The range of predicted outcomes has expanded to cover medical variables such as the longevity of cancer patients, the length of hospital stays, the diagnosis of cardiac disease, and the susceptibility of babies to sudden infant death syndrome; economic measures such as the prospects of success for new businesses, the evaluation of credit risks by banks, and the future career satisfaction of workers; questions of interest to government agencies, including assessments of the suitability of foster parents, the odds of recidivism among juvenile offenders, and the likelihood of other forms of violent behavior; and miscellaneous outcomes such as the evaluation of scientific presentations, the winners of football games, and the future prices of Bordeaux wine.

The connection between them? Says Kahneman: “Each of these domains entails a significant degree of uncertainty and unpredictability.” He called them “low-validity environments”, and in those environments, simple algorithms matched or outplayed humans and their “complex” decision making criteria, essentially every time.

***

A typical case is described in Michael Lewis' book on the relationship between Daniel Kahneman and Amos Tversky, The Undoing Project. He writes of work done at the Oregon Research Institute on radiologists and their x-ray diagnoses:

The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignant depended on the seven factors doctors had mentioned, equally weighted. The researchers then asked the doctors to judge the probability of cancer in ninety-six different individual stomach ulcers, on a seven-point scale from “definitely malignant” to “definitely benign.” Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn't notice they were being asked to diagnose the exact same ulcer they had already diagnosed. […] The researchers' goal was to see if they could create an algorithm that would mimic the decision making of doctors.

This simple first attempt, [Lewis] Goldberg assumed, was just a starting point. The algorithm would need to become more complex; it would require more advanced mathematics. It would need to account for the subtleties of the doctors' thinking about the cues. For instance, if an ulcer was particularly big, it might lead them to reconsider the meaning of the other six cues.

But then UCLA sent back the analyzed data, and the story became unsettling. (Goldberg described the results as “generally terrifying”.) In the first place, the simple model that the researchers had created as their starting point for understanding how doctors rendered their diagnoses proved to be extremely good at predicting the doctors' diagnoses. The doctors might want to believe that their thought processes were subtle and complicated, but a simple model captured these perfectly well. That did not mean that their thinking was necessarily simple, only that it could be captured by a simple model.

More surprisingly, the doctors' diagnoses were all over the map: The experts didn't agree with each other. Even more surprisingly, when presented with duplicates of the same ulcer, every doctor had contradicted himself and rendered more than one diagnosis: These doctors apparently could not even agree with themselves.

[…]

If you wanted to know whether you had cancer or not, you were better off using the algorithm that the researchers had created than you were asking the radiologist to study the X-ray. The simple algorithm had outperformed not merely the group of doctors; it had outperformed even the single best doctor.

The fact that doctors (and psychiatrists, and wine experts, and so forth) cannot even agree with themselves is a problem called decision making “noise”: Given the same set of data twice, we make two different decisions. Noise. Internal contradiction.

Algorithms win, at least partly, because they don't do this: The same inputs generate the same outputs every single time. They don't get distracted, they don't get bored, they don't get mad, they don't get annoyed. Basically, they don't have off days. And they don't fall prey to the litany of biases that humans do, like the representativeness heuristic.

The algorithm doesn't even have to be a complex one. As demonstrated above with radiology, simple rules work just as well as complex ones. Kahneman himself addresses this in Thinking, Fast and Slow when discussing Robyn Dawes's research on the superiority of simple algorithms using a few equally-weighted predictive variables:

The surprising success of equal-weighting schemes has an important practical implication: it is possible to develop useful algorithms without prior statistical research. Simple equally weight formulas based on existing statistics or on common sense are often very good predictors of significant outcomes. In a memorable example, Dawes showed that marital stability is well predicted by a formula: Frequency of lovemaking minus frequency of quarrels.

You don't want your result to be a negative number.

The important conclusion from this research is that an algorithm that is constructed on the back of an envelope is often good enough to compete with an optimally weighted formula, and certainly good enough to outdo expert judgment. This logic can be applied in many domains, ranging from the selection of stocks by portfolio managers to the choices of medical treatments by doctors or patients.

Stock selection, certainly a “low validity environment”, is an excellent example of the phenomenon.

As John Bogle pointed out to the world in the 1970's, a point which has only strengthened with time, the vast majority of human stock-pickers cannot outperform a simple S&P 500 index fund, an investment fund that operates on strict algorithmic rules about which companies to buy and sell and in what quantities. The rules of the index aren't complex, and many people have tried to improve on them with less success than might be imagined.

***

Another interesting area where this holds is interviewing and hiring, a notoriously difficult “low-validity” environment. Even elite firms often don't do it that well, as has been well documented.

Fortunately, if we take heed of the advice of the psychologists, operating in a low-validity environment has rules that can work very well. In Thinking Fast and Slow, Kahneman recommends fixing your hiring process by doing the following (or some close variant), in order to replicate the success of the algorithms:

Suppose you need to hire a sales representative for your firm. If you are serious about hiring the best possible person for the job, this is what you should do. First, select a few traits that are prerequisites for success in this position (technical proficiency, engaging personality, reliability, and so on). Don't overdo it — six dimensions is a good number. The traits you choose should be as independent as possible from each other, and you should feel that you can assess them reliably by asking a few factual questions. Next, make a list of questions for each trait and think about how you will score it, say on a 1-5 scale. You should have an idea of what you will call “very weak” or “very strong.”

These preparations should take you half an hour or so, a small investment that can make a significant difference in the quality of the people you hire. To avoid halo effects, you must collect the information one at a time, scoring each before you move on to the next one. Do not skip around. To evaluate each candidate, add up the six scores. […] Firmly resolve that you will hire the candidate whose final score is the highest, even if there is another one whom you like better–try to resit your wish to invent broken legs to change the ranking. A vast amount of research offers a promise: you are much more likely to find the best candidate if you use this procedure than if you do what people normally do in such situations, which is to go into the interview unprepared and to make choices by an overall intuitive judgment such as “I looked into his eyes and liked what I saw.”

In the battle of man vs algorithm, unfortunately, man often loses. The promise of Artificial Intelligence is just that. So if we're going to be smart humans, we must learn to be humble in situations where our intuitive judgment simply is not as good as a set of simple rules.

Does Experience Make You an Expert?

In Experience and validity of clinical judgment: The illusory correlation, Robyn Dawes explores the relationship between experience and accuracy.

There is research about the relationship between experience and diagnostic and predictive accuracy, and about the validity of interviewing people to find out what they are like. Garb has recently summarized the research on experience and accuracy. There is no relationship between years of clinical experience and accuracy of judgment. A report of a task force of the American Psychological Association convened in the early 1980s noted that there was no evidence that professional competence is related to years of professional experience.

And yet we seek experienced people to be our teachers, executives, and political leaders.

Ben Franklin is oft quoted as saying “experience is the best teacher,” the second clause reads “and fools will learn from no other.” Only Franklin didn't say “the best teacher” he said “dear teacher,” which was clearly intended to mean expensive.

The 10,000-hour rule, popularized by Malcolm Gladwell and based on Anders Ericsson’s study, The Role of Deliberate Practice in the Acquisition of Expert Performance, states that in order to become an expert, one must have 10,000 hours of deliberate practice under their belts. This has been highly disputed by many, including Ericsson himself. There’s no question that practice is necessary for improvement, but 10,000 hours isn’t a magic number that wields the power of universal application.

Something else to ponder: why is it that we often forget to account for the length of time that an expert has been out of practice in their field?

So does experience really make you an expert? What does it actually mean to be one? It turns out, we don't learn from experience in many contexts.

The analysis of what we learn and why we learn it, however, quickly yields sobriety about embracing generalizations about the effect of experience on learning across all contexts. For example, learning to sit in a chair, become a chess grandmaster, make a correct medical diagnosis, or avoid a war are quite different processes. The word “learning” is, of course, common to all, but close examination reveals that it means little more than that someone with no experience whatsoever could not accomplish any of these tasks.

Dawes illuminates this highly contrarian idea through quite unremarkable human behaviours like sitting in a chair and driving.

What then are the differences? First, consider sitting in a chair. It is a motor skill. It is done automatically. It does not involve any conscious hypotheses. It is clearly learned through early experience that provides immediate feedback about failure. Finally, it is not taught in the sense that one person conveys a verbal or mathematical description to another about how to do it. (In fact an amusing exercise is to attempt to write such a description, convince somebody else to follow your instructions explicitly—and then watch the person fail.) Driving a car has many similar characteristics. For example, steering it in a straight line is accomplished by very tiny discrete adjustments of the steering wheel that are not accomplished consciously (Ehrlich, 1966). (The “weaving” behavior of drunk drivers is often due to the impairment of these movements, rather than to any visual problem.) The skills needed to perform these slight movements are attained only through experience driving; in fact, most complete novices on the first driving lesson alternate between going toward the ditch and almost crossing the center line—much to the surprise and consternation of their novice teachers, who themselves may be unaware of their own “tremorous” movements of the steering wheel. As with sitting in a chair, explicit verbal instructions to someone else about exactly how to drive a car could result in disaster for the person who follows them rigidly.

Consider the curious thing that happened during the Paris Wine Tasting of 1976, alternatively known as the Judgement of Paris (its name was inspired by a story in Greek mythology). During this blind taste competition, French wine experts judged ten different reds and ten different whites. Contrary to the strongly held belief that France produced the finest wines, it was the California wines that received the highest scores. Not only did the shocking results of the competition call into question the supposed superiority of French wine, but it served as a reason for people to wonder what authority an expert had over a casual wine drinker.

Abstract of Experience and validity of clinical judgment: The illusory correlation

Mental health experts often justify diagnostic and predictive judgments on the basis of “years of experience” with a particular type of person. Justification by experience is common in legal settings, and can have profound consequences for the person about whom such judgments are made. However, research shows that the validity of clinical judgment and amount of clinical experience are unrelated. The role of experience in learning varies as a function of what is to be learned. Experiments show that learning conceptual categories depends upon: (1) the learner's having clear hypotheses about the possible rule for category membership prior to receiving feedback about which instances belong to the category, and, (2) the systematic nature of such feedback, especially about erroneous categorizations. Since neither of these conditions is satisfied in clinical contexts in psychology, the subsequent failure of experience per se to enhance diagnostic or predictive validity is unsurprising. Claims that “I can tell on the basis of my experience with people of a particular type (e.g., child abusers) that this person is of that type (e.g., a child abuser)” are simply invalid.

***

Still curious? Try reading, The Ambiguities of Experience. If you want to learn more about Dawes, check out his book Everyday Irrationality: How Pseudo-Scientists, Lunatics, And The Rest Of Us Systematically Fail To Think Rationally.

Social Dilemmas

Social dilemmas arise when an individual receives a higher payoff for defecting than cooperating, when everyone else cooperates. When everyone defects they are worse off. That is, each member has a clear and unambiguous incentive to make a choice, which if made by all members provides a worse outcome.

A great example of a social dilemma, is to imagine yourself out with a group of your friends for dinner. Before the meal you all agree to share the cost equally. Looking at the menu you see a lot of items that appeal to you but are outside of your budget.

Pondering on this, you realize that you're only on the hook for 1/(number of friends at the dinner) of the bill. Now you can enjoy yourself without having to pay the full cost.

But what if everyone at the table realized the same thing? My guess is you'd all be stunned by the bill, even the tragedy of the commons.

This is a very simple example but you can map this to the business word by thinking about healthcare and insurance.

If that sounds a lot like game theory, you're on the right track.

I came across an excellent paper by Robyn Dawes and David Messick, which takes a closer look at social dilemmas.

A Psychological Analysis of Social Dilemmas

In the case of the public good, one strategy that has been employed is to create a moral sense of duty to support it—for instance, the public television station that one watches. The attempt is to reframe the decision as doing one's duty rather than making a difference—again, in the wellbeing of the station watched. The injection of a moral element changes the calculation from “Will I make a difference” to “I must pay for the benefit I get.”

The final illustration, the shared meal and its more serious counterparts, requires yet another approach. Here there is no hierarchy, as in the organizational example, that can be relied upon to solve the problem. With the shared meal, all the diners need to be aware of the temptation that they have and there need to be mutually agreed-upon limits to constrain the diners. Alternatively, the rule needs to be changed so that everyone pays for what they ordered. The latter arrangement creates responsibility in that all know that they will pay for what they order. Such voluntary arrangements may be difficult to arrange in some cases. With the medical insurance, the insurance company may recognize the risk and insist on a principle of co-payments for medical services. This is a step in the direction of paying for one's own meal, but it allows part of the “meal' ‘ to be shared and part of it to be paid for by the one who ordered it.

The fishing version is more difficult. To make those harvesting the fish pay for some of the costs of the catch would require some sort of taxation to deter the unbridled exploitation of the fishery. Taxation, however, leads to tax avoidance or evasion. But those who harvest the fish would have no incentive to report their catches accurately or at all, especially if they were particularly successful, which simultaneously means particularly successfully—compared to others at least—in contributing to the problem of a subsequently reduced yield. Voluntary self-restraint would be punished as those with less of that personal quality would thrive while those with more would suffer. Conscience, as Hardin (1968) noted, would be self-eliminating. …

Relatively minor changes in the social environment can induce major changes in decision making because these minor changes can change the perceived appropriateness of a situation. One variable that has been shown to make such a difference is whether the decision maker sees herself as an individual or as a part of a group.