NOISE
A Flaw in Human Judgment
By Daniel Kahneman, Olivier Sibony and Cass R. Sunstein
A study of 1.5 million cases found that when judges are passing down sentences on days following a loss by the local city’s football team, they tend to be tougher than on days following a win. The study was consistent with a steady stream of anecdotal reports beginning in the 1970s that showed sentencing decisions for the same crime varied dramatically — indeed scandalously — for individual judges and also depending on which judge drew a particular case.
A study at an oncology center found that the diagnostic accuracy of melanomas was only 64 percent, meaning that doctors misdiagnosed melanomas in one of every three lesions.
When two psychiatrists conducted independent reviews of 426 patients in state hospitals, they came to the equivalent of a tossup: agreement 50 percent of the time on what kind of mental illness was present.
When a large insurance company, concerned about quality control, asked its underwriters, who determine premium rates based on risk assessments, to come up with estimates for the same group of sample cases, their suggested premiums varied by an eye-popping median of 55 percent, meaning that one adjuster might have set a premium at $9,500 while a colleague set it at $16,700.
Doctors are more likely to order cancer screenings for patients they see early in the morning than late in the afternoon.
If employers rely on only one job interview to pick a candidate from among a similarly qualified group, the chances that this candidate will indeed perform better than the others are about 56 percent to 61 percent. That’s “somewhat better than flipping a coin, for sure,” the authors of “Noise” write, “but hardly a fail-safe way to make important decisions.”
In a study of the effectiveness of putting calorie counts on menu items, consumers were more likely to make lower-calorie choices if the labels were placed to the left of the food item rather than the right.
“When calories are on the left, consumers receive that information first and evidently think ‘a lot of calories!’ or ‘not so many calories!’ before they see the item,” Daniel Kahneman, Olivier Sibony and Cass R. Sunstein explain in this tour de force of scholarship and clear writing. “By contrast, when people see the food item first, they apparently think ‘delicious!’ or ‘not so great!’ before they see the calorie label. Here again, their initial reaction greatly affects their choices.” This hypothesis is supported, the authors write in a typically clever aside, by the “finding that for Hebrew speakers, who read right to left, the calorie label has a significantly larger impact if it is on the right rather than the left.”
These inconsistencies are all about noise, which Kahneman, Sibony and Sunstein define as “unwanted variability in judgments.”
Sometimes we treasure variability — in artistic tastes, political views or picking friends. But in many situations, we seek consistency: medicine, criminal justice, child custody decisions, economic forecasts, hiring, college admissions, fingerprint analysis or business choices about whether to greenlight a movie or consummate a merger.
Consistency equals fairness. If bias can be eliminated and sensible processes put in place, we should be able to arrive at the “right” result. Lack of consistency too often produces the wrong results because it’s often no better, the authors write, than the random judgments of “a dart-throwing chimpanzee.” And, of course, unexplained inconsistency undermines credibility and the systems in which those judgements are made.
As the authors explain in their introduction, a team of target shooters whose shots always fall to the right of the bull’s-eye is exhibiting a bias, as is a judge who always sentences Black people more harshly. That’s bad, but at least they are consistent, which means the biases can be identified and corrected. But another team whose shots are scattered in different directions away from the target is shooting noisily, and that’s harder to correct. A third team whose shots all go to the left of the bull’s-eye but are scattered high and low is both biased and noisy.
Despite its prominence in so many realms of human judgment, the authors note that “noise is rarely recognized,” let alone counteracted. Which is why the parade of noise examples that the authors provide are so compelling, and why gathering the examples in one place to demonstrate the cost of noise and then suggesting noise reduction techniques, or “decision hygiene,” makes this book so important. We are living in a moment of rampant polarization and distrust in the fundamental institutions that underpin civil society. Eradicating the noise that leads to random, unfair decisions will help us regain trust in one another.
“Noise” seems certain to make a mark by calling attention to the problem and providing a tangible guide to reducing it. Despite the authors’ intimidating academic credentials, they take pains to explain, even with welcome redundancy, their various categories of noise, the experiments and formulas that they introduce, as well as their conclusions and solutions.
Some decision hygiene is relatively easy. “Occasion noise” — the problem of a judge handing out stiffer sentences depending on whether a favorite sports team won or lost or whether it’s before or after lunch (yes, studies have found that, too) — can, like bias, be recognized during a “noise audit” and presumably dealt with. “System noise,” in which insurance adjusters, doctors, project planners or business strategists assess the same facts with that unfortunate variability, requires a more energetic decision hygiene.
However, as the authors point out, the steps of decision hygiene — like those of common hygiene, such as washing hands — “can be tedious. Their benefits are not directly visible; you might never know what problem they prevented from occurring.”
One example of effective decision hygiene has to do with the Apgar score, which looks at the overall health of newborns. Doctors score the baby on five criteria ranging from appearance of its skin to its heart rate, with scores of zero to two for each category. If the scores, once added up, arrive at a seven or higher, the baby is considered to be in good health.
“The Apgar score exemplifies how guidelines work and why they reduce noise,” the authors explain. “Unlike rules or algorithms, guidelines do not eliminate the need for judgment: The decision is not a straightforward computation. Disagreement remains possible on each of the components and hence on the final conclusion. Yet guidelines succeed in reducing noise because they decompose a complex decision into a number of easier subjudgments on predefined dimensions.”
Another compelling example of “decomposing” a decision involves a case study of a corporate merger decision. Rather than the bankers and executive team giving the company’s board the usual pro or con presentation, the C.E.O. first tasked various senior executives to come up with their assessments on seven aspects of the merger, ranging from talent of the team to be acquired to the possible financial benefits. Importantly, there were separate teams working on each aspect, so that their judgment was not colored by positive or negative noise emanating from another verdict, falling into the trap of what the authors call “excessive coherence.”
It’s also for that reason that none of four people interviewing a job candidate should know what their colleagues’ opinions are until they write down their own.
In other arenas, such as insurance underwriting, “Noise” does lean more toward establishing hard and fast rules and even using algorithms, which the authors assert should, in theory, “eliminate noise entirely.” However, they acknowledge that the way information gets entered into algorithms can itself be undermined by bias or noise.
The authors are sensitive to the costs of noise reduction, a point they illustrate in part with the story of the company that tangled itself up in an annual employee review process that included an overly complicated feedback questionnaire. Forty-six ratings on eleven dimensions for each rater and person being rated is just too much.
Similarly, the costs of eliminating noise have to be weighed. A fifth grader’s essay will be more fairly and accurately graded if five teachers read it independently using five or 10 criteria and averaging their assessments, instead of one teacher reading it and providing an overall impression. So will a high school senior’s college application. We can accept the noise in the fifth grader’s essay grade much more easily than we can accept it when deciding a college applicant’s fate.
Beyond bureaucracy and cost, there’s a loss of dignity when people are treated like numbers instead of individuals. There’s also the danger of forcing a rule — think of Jack Welch, the former C.E.O. of General Electric, who made it a set practice to fire a percentage of his lowest performers each year, even if many were still performing well. Forced ranking in this context, or in the case of an elite military unit, makes no sense, and relative scales and relative judgments would have made for better decision hygiene. In other situations, the opposite approach can create problems: rating everyone individually with no comparisons, such as the loosey-goosey standards that allow over 98 percent of the federal civil servant work force to be judged “fully successful.”
Thus, the authors cite the lawyer and author Philip Howard, who in books such as “The Death of Common Sense” has documented the dangers of bureaucracy, laws, rules and numerical ratings replacing human judgment in so many decisions.
Kahneman, Sibony and Sunstein also acknowledge the judicial backlash against federal sentencing guidelines passed in 1987 that were meant to reduce the massive inconsistencies. Many judges fervently believed that these federal guidelines — and even more stringent ones legislated in many states — sidelined them from making the human judgments they were put on the bench to make. That continuing backlash, and the fact that prosecutors and judges learned to game the new rules, has been a key force behind recent criminal justice reform efforts.
The authors’ general argument, however, is that there is now so much noise that a major hygiene effort is in order across multiple disciplines. In too many arenas, they maintain persuasively, we’ve allowed too much noise at too high a cost.
The trick is finding the right balance, not looking for perfect fairness or accuracy, which will always be illusory. A rule that sets a birth date of Jan. 1 for entrance into kindergarten is going to be arbitrary and unfair to the child born at 11:59 the night before. (Although another rule will give her parents a bonus tax deduction for being born in that earlier year.) But it’s a better way to choose who gets to start elementary school than interviewing every 4- to 6-year-old.
A digital body scan examined only by an algorithm might be an efficient way to check for melanoma, but I’d rather trust the terrific doctor who checks me every few months. Then again, I wouldn’t mind if he checked his conclusion against the algorithm.
“Noise” is about how our most important institutions can make decisions that are more fair, more accurate and more credible. That its prescriptions will not achieve perfect fairness and credibility, while creating pitfalls of their own, is no reason to turn away from this welcome handbook for making life’s lottery a lot more coherent.
Steven Brill, author of “Tailspin: The People and Forces Behind America’s Fifty-Year Fall — and Those Fighting to Reverse It,” is the co-C.E.O. of NewsGuard, which rates the reliability and trustworthiness of news websites.
NOISE
A Flaw in Human Judgment
By Daniel Kahneman, Olivier Sibony and Cass R. Sunstein
454 pp. Little, Brown & Company. $32.