To last week American crossword tournament, organized as a virtual event with more than 1,000 participants, an impressive competitor made the headlines. (And, despite my 143rd place, it was unfortunately not me.) For the first time, artificial intelligence managed to overtake human solvers in the race to fill in the grids with speed and precision. It was a triumph for Dr. Fill, a crossword puzzle solving automaton who has been battling carbon-based cruciverbalists for nearly a decade.
To some observers, this may have seemed like just another area of human endeavor where AI now has the upper hand. Report on Dr. Fill’s Achievements for slate, Oliver Roeder wrote: “Checkers, backgammon, chess, go, poker and other games have witnessed the invasions of the machines, falling one by one into the hands of the dominant AIs. Now the crossword has joined them. But a look at how Dr. Fill pulled off this feat reveals much more than the latest battle between humans and computers.
When IBM’s Watson supercomputer outperformed Ken Jennings and Brad Rutter on Peril! a little over 10 years ago, Jennings replied, “For my part, I welcome our new computer lords.” But Jennings was a little premature to throw in the towel on behalf of mankind. Both yesterday and today, the latest advances in AI show not only the potential for computational understanding of natural language, but also its limits. And in Dr. Fill’s case, his performance tells us just as much about the mental arsenal humans bring in the particular linguistic challenge of solving a crossword puzzle, matching minds with inventive souls who invent the puzzles. In fact, a closer look at how software tries to break down an evil crossword clue provides new insight into what our own brains do when we play with language.
Dr. Fill was hatched by Matt Ginsberg, a computer scientist who is also a published crossword builder. Since 2012, he informally integrated Dr. Fill into the CTPA, making incremental improvements to the solving software each year. This year, however, Ginsberg joined forces with the Berkeley Natural Language Processing Group, composed of graduate and undergraduate students supervised by Dan Klein, professor at UC Berkeley.
Klein and his students began serious work on the project in February, then reached out to Ginsberg to see if they could combine their efforts for this year’s tournament. Just two weeks before the launch of the CTPA, they together hacked a hybrid system in which the Berkeley Group’s neural network methods for interpreting clues worked in tandem with Ginsberg’s code to efficiently fill out a crossword puzzle. .
(Spoilers ahead for anyone interested in solve ACPT puzzles afterwards.)
The new and improved Dr. Fill fills the grid in a wave of activity (you can see it in action here). But in reality, the program is deeply methodical, analyzing a clue and establishing an initial ranked list of candidates for the answer, then narrowing down the possibilities based on factors such as their fit with other answers. The correct answer may be buried deep in the candidate list, but enough context can allow it to percolate upwards.
Dr. Fill is trained on data gleaned from past crosswords that have appeared in various media. To solve a puzzle, the program refers to clues and answers that it has already “seen”. Like humans, Dr. Fill must draw on what he has learned in the past in the face of a new challenge, looking for connections between new and old experiences. For example, the second puzzle of the competition, built by the Wall Street newspaper Mike Shenk, crossword editor, relied on a theme in which long responses had the letters -ITY added to form fancy new sentences, such as OPIUM DENS becoming OPIUM DENSITY (in the form of “Factor in Power of a poppy product? Dr. Fill was lucky, because despite the unusual sentences, a few of the answers appeared in a crossword on the same topic published in 2010. in the Los Angeles Times, which Ginsberg has included in his database of more than 8 million clues and answers. But the tournament’s crossword clues were different enough that Dr. Fill was always challenged to come up with the correct answers. (OPIUM DENSITY, for example, was qualified in 2010 as “Measuring drug trafficking in the neighborhood?”)
For all the answers, whether they are part of the puzzle theme or not, the program works through thousands of possibilities to generate candidates that best match the clues, ranking them by probability and comparing them to the constraints of the grid, such as how through and down entrances lock. Sometimes the best candidate is the right one: For the “force groups” clue, for example, Dr. Fill ranked the correct answer, ARRAYS, as his preferred word. The word “imposing” had never appeared in previous clues for the word, but other synonymous words like “impressive” did, allowing Dr. Fill to infer the semantic connection.