Tag: Paul Meehl

Do Algorithms Beat Us at Complex Decision Making?

Algorithms are all the rage these days. AI researchers are taking more and more ground from humans in areas like rules-based games, visual recognition, and medical diagnosis. However, the idea that algorithms make better predictive decisions than humans in many fields is a very old one.

In 1954, the psychologist Paul Meehl published a controversial book with a boring sounding name: Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review of the Evidence.

The controversy? After reviewing the data, Meehl claimed that mechanical, data-driven algorithms could better predict human behavior than trained clinical psychologists — and with much simpler criteria. He was right.

The passing of time has not been friendly to humans in this game: Studies continue to show that the algorithms do a better job than experts in a range of fields. In Daniel Kahneman's Thinking Fast and Slow, he details a selection of fields which have demonstrated inferior human judgment compared to algorithms:

The range of predicted outcomes has expanded to cover medical variables such as the longevity of cancer patients, the length of hospital stays, the diagnosis of cardiac disease, and the susceptibility of babies to sudden infant death syndrome; economic measures such as the prospects of success for new businesses, the evaluation of credit risks by banks, and the future career satisfaction of workers; questions of interest to government agencies, including assessments of the suitability of foster parents, the odds of recidivism among juvenile offenders, and the likelihood of other forms of violent behavior; and miscellaneous outcomes such as the evaluation of scientific presentations, the winners of football games, and the future prices of Bordeaux wine.

The connection between them? Says Kahneman: “Each of these domains entails a significant degree of uncertainty and unpredictability.” He called them “low-validity environments”, and in those environments, simple algorithms matched or outplayed humans and their “complex” decision making criteria, essentially every time.

***

A typical case is described in Michael Lewis' book on the relationship between Daniel Kahneman and Amos Tversky, The Undoing Project. He writes of work done at the Oregon Research Institute on radiologists and their x-ray diagnoses:

The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignant depended on the seven factors doctors had mentioned, equally weighted. The researchers then asked the doctors to judge the probability of cancer in ninety-six different individual stomach ulcers, on a seven-point scale from “definitely malignant” to “definitely benign.” Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn't notice they were being asked to diagnose the exact same ulcer they had already diagnosed. […] The researchers' goal was to see if they could create an algorithm that would mimic the decision making of doctors.

This simple first attempt, [Lewis] Goldberg assumed, was just a starting point. The algorithm would need to become more complex; it would require more advanced mathematics. It would need to account for the subtleties of the doctors' thinking about the cues. For instance, if an ulcer was particularly big, it might lead them to reconsider the meaning of the other six cues.

But then UCLA sent back the analyzed data, and the story became unsettling. (Goldberg described the results as “generally terrifying”.) In the first place, the simple model that the researchers had created as their starting point for understanding how doctors rendered their diagnoses proved to be extremely good at predicting the doctors' diagnoses. The doctors might want to believe that their thought processes were subtle and complicated, but a simple model captured these perfectly well. That did not mean that their thinking was necessarily simple, only that it could be captured by a simple model.

More surprisingly, the doctors' diagnoses were all over the map: The experts didn't agree with each other. Even more surprisingly, when presented with duplicates of the same ulcer, every doctor had contradicted himself and rendered more than one diagnosis: These doctors apparently could not even agree with themselves.

[…]

If you wanted to know whether you had cancer or not, you were better off using the algorithm that the researchers had created than you were asking the radiologist to study the X-ray. The simple algorithm had outperformed not merely the group of doctors; it had outperformed even the single best doctor.

The fact that doctors (and psychiatrists, and wine experts, and so forth) cannot even agree with themselves is a problem called decision making “noise”: Given the same set of data twice, we make two different decisions. Noise. Internal contradiction.

Algorithms win, at least partly, because they don't do this: The same inputs generate the same outputs every single time. They don't get distracted, they don't get bored, they don't get mad, they don't get annoyed. Basically, they don't have off days. And they don't fall prey to the litany of biases that humans do, like the representativeness heuristic.

The algorithm doesn't even have to be a complex one. As demonstrated above with radiology, simple rules work just as well as complex ones. Kahneman himself addresses this in Thinking, Fast and Slow when discussing Robyn Dawes's research on the superiority of simple algorithms using a few equally-weighted predictive variables:

The surprising success of equal-weighting schemes has an important practical implication: it is possible to develop useful algorithms without prior statistical research. Simple equally weight formulas based on existing statistics or on common sense are often very good predictors of significant outcomes. In a memorable example, Dawes showed that marital stability is well predicted by a formula: Frequency of lovemaking minus frequency of quarrels.

You don't want your result to be a negative number.

The important conclusion from this research is that an algorithm that is constructed on the back of an envelope is often good enough to compete with an optimally weighted formula, and certainly good enough to outdo expert judgment. This logic can be applied in many domains, ranging from the selection of stocks by portfolio managers to the choices of medical treatments by doctors or patients.

Stock selection, certainly a “low validity environment”, is an excellent example of the phenomenon.

As John Bogle pointed out to the world in the 1970's, a point which has only strengthened with time, the vast majority of human stock-pickers cannot outperform a simple S&P 500 index fund, an investment fund that operates on strict algorithmic rules about which companies to buy and sell and in what quantities. The rules of the index aren't complex, and many people have tried to improve on them with less success than might be imagined.

***

Another interesting area where this holds is interviewing and hiring, a notoriously difficult “low-validity” environment. Even elite firms often don't do it that well, as has been well documented.

Fortunately, if we take heed of the advice of the psychologists, operating in a low-validity environment has rules that can work very well. In Thinking Fast and Slow, Kahneman recommends fixing your hiring process by doing the following (or some close variant), in order to replicate the success of the algorithms:

Suppose you need to hire a sales representative for your firm. If you are serious about hiring the best possible person for the job, this is what you should do. First, select a few traits that are prerequisites for success in this position (technical proficiency, engaging personality, reliability, and so on). Don't overdo it — six dimensions is a good number. The traits you choose should be as independent as possible from each other, and you should feel that you can assess them reliably by asking a few factual questions. Next, make a list of questions for each trait and think about how you will score it, say on a 1-5 scale. You should have an idea of what you will call “very weak” or “very strong.”

These preparations should take you half an hour or so, a small investment that can make a significant difference in the quality of the people you hire. To avoid halo effects, you must collect the information one at a time, scoring each before you move on to the next one. Do not skip around. To evaluate each candidate, add up the six scores. […] Firmly resolve that you will hire the candidate whose final score is the highest, even if there is another one whom you like better–try to resit your wish to invent broken legs to change the ranking. A vast amount of research offers a promise: you are much more likely to find the best candidate if you use this procedure than if you do what people normally do in such situations, which is to go into the interview unprepared and to make choices by an overall intuitive judgment such as “I looked into his eyes and liked what I saw.”

In the battle of man vs algorithm, unfortunately, man often loses. The promise of Artificial Intelligence is just that. So if we're going to be smart humans, we must learn to be humble in situations where our intuitive judgment simply is not as good as a set of simple rules.