In 1931, a 25 year-old Austrian named Kurt Gödel defied logic, literally, with his incompleteness theorems which proved that every formal system eventually fails. What made this discovery even more devastating was that Gödel himself was a logician and his theorems took the form of a logical proof. Essentially, he used logic to kill logic.
Of course, the systems we rely on every day are not formal logical systems. They’re far more vulnerable. From flaws in the design of their hardware and software, to errors in how data is collected, analyzed and used to make decisions, real-world technologies are riddled with weaknesses.
Nevertheless, in the era of big data, we’ve given these systems enormous power over us and how we run our enterprises. Today, algorithms can help determine what college we attend, if we get hired for a job and even who goes to prison and for how long. We have, seemingly unwittingly, outsourced judgment to machines and, unlike humans, we rarely question them.
Numbers that show up on a computer screen take on a special air of authority. Data are pulled in through massive databases and analyzed through complex analytics software. Eventually, they make their way to Excel workbooks, where they are massaged further into clear metrics for decision making.
But where does the all that data come from? In many cases, from lowly paid, poorly trained front-line employees recording it on clipboards as part of their daily drudgery. Data, as it’s been said, is the plural of anecdote and is subject to error. We can — and should — try to minimize these errors whenever we can, but we will likely never eliminate them entirely.
As MIT’s Zeynep Ton explains in The Good Jobs Strategy, which focuses on the retail industry, even the most powerful systems require human input and judgment. Cashiers need to ring up products with the right codes, personnel in the back room need to place items where they can be found and shelves need to be stocked with the right products.
Errors in any of these places can result in data errors and cause tangible problems, like phantom stockouts, which can lead to poor decisions higher up in the organization, like purchasing and marketing. These seemingly small mistakes can be incredibly pervasive. In fact, one study found that 65% of a retailer’s inventory data was inaccurate.
The truth is that no amount of complex tables and graphs can hide the fact that humans, with all of their faults, lie behind every system.
Firing Off Numbers
In Weapons of Math Destruction, data scientist Cathy O’Neil explains how data has become so pervasive in our lives, we hardly even notice it until it affects us directly. One application that has become particularly common is the use of algorithms to evaluate job performance.
For example, she tells the story of Sarah Wysocki, a teacher who, despite being widely respected by her students, their parents and her peers, was fired because she performed poorly according to an algorithm. Another teacher found himself in the 4th percentile one year only to score in the 96th percentile the next, although his methods never changed.
When an algorithm rates you poorly, you are branded as an underperformer and there is rarely an opportunity to appeal those judgments — or even know how they are made. In many cases, methods are considered “proprietary” and no details are shared. It’s hard to imagine a person’s judgment going so completely unquestioned, but data is often treated as unassailable.
In fact, the whole notion that school performance in America is declining is, at least in part, based on a data mistake. A Nation At Risk, the report during the Reagan Administration that set off the initial alarm bells about declining SAT scores. Yet if they had taken a closer look, they would have noticed that the scores in each subgroup were increasing.
This is a basic statistical error known as Simpson’s paradox. The reason for the decline in the average score was that more disadvantaged kids were taking the test — an encouraging sign — but because of data malpractice, teachers as a whole were judged to be failing.
Overfitting The Past
Imagine we’re running a business that hires 100 people a year and we want to build a predictive model that would tell us what colleges we should focus our recruiting efforts on. A seemingly reasonable approach would be to examine where we’ve recruited people in the past and how they performed. Then we could focus recruiting from the best performing schools.
On the surface, that would seem to make sense, but if you take a closer look it is inherently flawed. First of all, 100 students spread across perhaps a dozen colleges is far from statistically significant. Second. It’s not hard to see how a one or two standouts or dullards from a particular school would skew the results massively.
A related problem is what statisticians call overfitting, which basically means that because there is an element of randomness in every data set, the more specifically we tailor a predictive model to the past the less likely it is to reflect the future. In other words, the more detailed we make our model to fit the data, the worse our predictions are likely to get.
That may seem counterintuitive, and it is, which is why overfitting is so common. People who sell predictive software love to be able to say things like, “our model has been proven to be 99.8% accurate,” even if that is often an indication that their product is actually less reliable than one that is, say, 80% accurate, but far simpler and more robust.
Who’s Testing The Tests?
Wall Street is famous for its “quants,” high paid mathematicians who build complex models to predict market movements and design trading strategies. These are really smart people who are betting millions and millions of dollars. Even so, it is not at all uncommon for their models to fail, sometimes disastrously.
They key difference between those models and many of the the ones being peddled around these days is that Wall Street traders lose money when their data model go wrong. However, as O’Neil points out in her book, the effects of machine driven judgments are not borne by those that design the algorithms, but by everyone else.
As we increasingly rely on machines to make decisions, we need to ask the same questions of them that we would humans. What assumptions are inherent in your model? What hasn’t been taken into account? How are we going to test the effectiveness of the conclusions going forward?
Clearly, something has gone terribly awry. When machines replace human judgment, we should hold them to a high standard. We should know how the data was collected, how conclusions are arrived at and whether they actually improve things. And when numbers lie, we should stop listening to them. Anything less is data malpractice.
An earlier version of this article first appeared in Inc.com