Deep learning sparked the latest AI revolution, transforming computer vision and the field as a whole. Hinton believes deep learning should be almost all you need to fully reproduce human intelligence.
But despite rapid progress, major challenges remain. Expose a neural network to an unknown dataset or an alien environment, and it becomes fragile and inflexible. Self-driving cars and language generators for writing essays impress, but things can go wrong. Visual AI systems can be easily confused: a recognized cup of coffee from the side would be an unknown from above if the system had not been trained on this sight; and with the manipulation of a few pixels, a panda can be mistaken for an ostrich, or even a school bus.
GLOM addresses two of the most difficult problems for visual perception systems: understanding an entire scene in terms of objects and their natural parts; and recognize objects seen from a new point of view (GLOM focuses on vision, but Hinton expects the idea could be applied to language as well).
An object like Hinton’s face, for example, is made up of his keen but dog-tired eyes (too many people asking questions; too little sleep), his mouth and ears, and a prominent nose, all topped off of a not-too much. -a dark, predominantly gray ruffle. And given his nose, he is easily recognizable even at first glance in profile view.
These two factors – the part-whole relationship and the point of view – are, from Hinton’s point of view, crucial to how humans make a vision. “If GLOM ever works,” he says, “it will do perception in a much more human way than current neural networks.”
Grouping parts into sets, however, can be a difficult problem for computers because the parts are sometimes ambiguous. A circle can be an eye, a donut, or a wheel. As Hinton explains, the first generation of AI vision systems attempted to recognize objects relying primarily on the geometry of the part-whole relationship – the spatial orientation between the parts and between the parts and the whole. . Instead, the second generation relied primarily on deep learning, letting the neural network train on large amounts of data. With GLOM, Hinton combines the best aspects of both approaches.
“There’s a certain intellectual humility that I love about it,” says Gary Marcus, founder and CEO of Robust.AI and a well-known critic of the heavy addiction to deep learning. Marcus admires Hinton’s willingness to challenge something that has brought him fame, to admit that it doesn’t quite work. “It’s brave,” he said. “And it’s a great fix to say, ‘I’m trying to think outside the box.'”
In GLOM manufacturing, Hinton tried to model some of the mental shortcuts – intuitive or heuristic strategies – that people use to make sense of the world. “GLOM, and indeed a lot of Geoff’s job, is looking at the heuristics people seem to have, building neural networks that themselves might have those heuristics, and then showing that networks do better in vision.” says Nick Frosst, a computer scientist at a language startup in Toronto who worked with Hinton at Google Brain.
With visual perception, one strategy is to analyze parts of an object, such as different facial features, and thus understand the whole. If you see a certain nose, you might recognize it as part of Hinton’s face; it is a partial hierarchy. To build a better vision system, Hinton says, “I have a strong hunch that we need to use partial hierarchies.” Human brains understand this part-while composition by creating what is called a “analysis tree” – a branch diagram demonstrating the hierarchical relationship between the whole, its parts, and its sub-parts. The face itself is at the top of the tree, and the component eyes, nose, ears, and mouth form the branches below.
One of Hinton’s main goals with GLOM is to replicate the Analysis Tree in a neural network – this would distinguish it from neural networks that existed before. For technical reasons, this is difficult to do. “It’s difficult because each individual image would be analyzed by a person in a single analysis tree, so we would want a neural network to do the same,” says Frosst. “It’s hard to get something with a static architecture – a neural network – to adopt a new structure – an analysis tree – for every new image he sees.” Hinton made various attempts. GLOM is a major overhaul of its previous attempt in 2017, combined with other related advancements in the field.
“I am part of a nose!”
A generalized way of thinking about GLOM architecture is as follows: The image of interest (for example, a photograph of Hinton’s face) is divided into a grid. Each region of the grid is a “location” on the image – one location may contain the iris of one eye, while another may contain the tip of its nose. For each location in the network, there are approximately five layers or levels. And level by level, the system makes a prediction, with a vector representing the content or information. At one level near the bottom, the vector representing the location of the tip of the nose can predict, “I’m part of a nose!” And at the top level, by building a more consistent representation of what it sees, the vector could predict, “I’m part of a face in side angle view!”
But then the question is whether the neighboring vectors at the same level agree? When they agree, the vectors point in the same direction, to the same conclusion: “Yes, we both belong to the same nose.” Or higher in the analysis tree. “Yes, we both belong to the same face.”
Seeking consensus on the nature of an object – on what precisely the object is, ultimately – GLOM’s vectors iteratively, location by location and layer by layer, averaging with neighboring vectors next to it, as well as the predicted vectors from the upper and lower levels.
However, the net is not “willy-nilly” with anything nearby, Hinton says. It selectively averages, with neighboring predictions showing similarities. “It’s a bit well known in America, it’s called an echo chamber,” he says. “What you do is you only accept the opinions of people who already agree with you; and then what happens is you get an echo chamber where a whole bunch of people have the exact same opinion. GLOM actually uses it constructively. The analogous phenomenon in Hinton’s system is these “islands of agreement.”
“Imagine a group of people in a room, shouting slight variations of the same idea,” says Frosst – or imagine these people as vectors pointing to slight variations in the same direction. “After a while, they would converge on a single idea, and they would all feel it stronger, because they had confirmed it by the other people around them. This is how GLOM’s vectors reinforce and amplify their collective predictions on an image.
GLOM uses these islands of agreement vectors to accomplish the trick of representing an analysis tree in a neural network. While some recent neural networks use the agreement between vectors to Activation, GLOM uses a chord to representation– construct representations of objects in the network. For example, when several vectors agree that they all represent a part of the nose, their small agreement group collectively represents the nose in the analysis tree of the face mesh. Another small group of agreement vectors could represent the mouth in the analysis tree; and the large group at the top of the tree would represent the emerging conclusion that the picture as a whole is Hinton’s face. “The way the analysis tree is represented here,” Hinton explains, “is that at the object level you have a large island; parts of the object are smaller islands; the subparts are even smaller islands, and so on. “
According to Hinton’s longtime friend and collaborator Yoshua Bengio, a computer scientist at the University of Montreal, if GLOM manages to solve the technical challenge of representing an analysis tree in a neural network, that would be an achievement – it would be important for proper functioning of neural networks. “Geoff has produced incredibly powerful intuitions on several occasions during his career, many of which have turned out to be correct,” Bengio says. “Therefore, I pay attention to them, especially when he’s as attached to them as he is to GLOM.”
The strength of Hinton’s conviction is rooted not only in the echo chamber analogy, but also in the mathematical and biological analogies that inspired and justified some of the design decisions in GLOM’s new engineering.
“Geoff is a very unusual thinker in that he is able to draw on complex mathematical concepts and integrate them with biological constraints to develop theories,” says Sue Becker, a former Hinton student, now a neuroscientist. cognitive computational at McMaster University. “Researchers who focus more narrowly on mathematical theory or neurobiology are much less likely to solve the endlessly fascinating puzzle of how machines and humans can learn and think.”
Transform philosophy into engineering
So far, Hinton’s new idea has been well received, especially in some of the world’s largest echo chambers. “On Twitter, I have a lot of likes,” he says. And one Youtube tutorial claims the term “MeGLOMania”.
Hinton is the first to admit that at present GLOM is little more than a philosophical reflection (he spent a year as an undergraduate in philosophy before moving on to experimental psychology). “If an idea sounds good in philosophy, that’s good,” he says. “How would you ever have a philosophical idea that just sounds like crap, but actually turns out to be true?” It would not pass for a philosophical idea. Science, by comparison, is “full of things that look like complete garbage” but work remarkably well – for example, neural networks, he says.
GLOM is designed to sound philosophically plausible. But will it work?