Years later, what did we learn from Watson winning Jeopardy?

If you are a nerd like me, you might remember watching IBM’s Watson playing Jeopardy against legendary champion Ken Jennings and Brad Rutter, another very strong player–and Watson ate their lunch. At the time, it was perhaps the most powerful example of how AI was starting to change our ideas of what computers can do. And, in the eight years since, we have struggled between how AI is magic and how AI just doesn’t work. And both of those lessons were visible in that Jeopardy tournament.

Watson seemed brilliant most of the time–and won the tournament. That astonished many just as Deep Blue had shocked people years earlier by defeating Garry Kasparov at chess. Like Ken Jennings, Kasparov was not an ordinary chess champion–he was a master among masters. So, we all remember the “AI is magic” lesson from Jeopardy.

Fewer remember the “AI doesn’t work” moment. In one “final” Jeopardy situation, the answer was displayed for the category “US Cities”:

Its largest airport is named after a World War II battle, and it’s second largest is named after a World War II hero

Watson answered, “What is Toronto?”

This is what we in the AI business call a “howler.” It’s an answer so bad that not only is it wrong, but even if you don’t yourself know the right answer, you still know this answer is wrong. A howler is not even a good guess–obviously, Toronto is not a US city.

And this is what gives us humans pause about AI. To us, it seems to go completely off the rails sometimes and give a response that makes no sense to us. And that reduces our confidence in the system. If it gives an answer that dumb, why should we believe its other answers?

Now, you might say, “Who cares? It’s just a game!” And so it is. But what if this was a medical diagnosis? If it gives such a wacky diagnosis once in a while, would you be willing to listen to it the rest of the time just because it is right more often than humans are? This is why some people still believe that “AI doesn’t work.”

What’s actually the case is that AI is neither magic nor ineffective. It’s just another technology that can be used to help us humans. That’s why the best implementations of AI are not to replace humans, but to augment their judgement. See, Watson’s second highest-rated answer in its list was “What is Chicago?” which is the correct answer. If a human being had been using Watson to augment human judgement, rather than replace it, the right answer would have been delivered, because the human would have eliminated the Toronto answer and gone with Chicago.

But you might be interested to know why Watson gave that Toronto answer. I used to work with the folks in IBM Research that built Watson, so I got to ask them what happened. They explained to me that they at first used the categories to limit Watson’s answers–with that feature enabled Watson would have thrown out Toronto and gone with Chicago. But they turned that feature off, because Watson was actually getting more right answers by not using the category to limit it. So many of Jeopardy’s categories are jokes and tortured puns that it was hard for Watson to get the limits right. What the IBM team failed to consider is that Jeopardy never uses jokes for the final Jeopardy category. And that is how Watson blew that answer.

It’s possible to design “human-in-the loop” systems that avoid these problems, and it is even possible to design systems that make fewer howler errors, because the system can be set to optimize around making less egregious errors–it can raise our confidence in the system if it makes somewhat more mistakes but those mistakes don’t seem outlandish.

Today’s AI systems are working just that way, and you can understand exactly why if you watched Watson on Jeopardy.