AI and machine learning systems have become increasingly proficient in recent years, able not only to understand the written word, but also to write it. But while these artificial intelligences have almost mastered the English language, they have not yet mastered the language of computers – that is, until now. IBM announced Monday at its Think 2021 conference that its researchers had designed a Rosetta Stone for programming code.
Over the past decade, advances in AI have been primarily “driven by deep neural networks, and even that has been driven by three major factors: data with the availability of large datasets for training, innovations in new algorithms and massive acceleration. faster, faster GPU-driven computing hardware, ”said Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, during his Think 2021 presentation, comparing the new dataset to revered ImageNet, which spawned the recent rush for computer vision.
“Software eats the world,” Marc Andreessen wrote in 2011. “And if software eats the world, AI eats software,” Puri told Engadget. “It was this relationship between visual tasks and language tasks, when common algorithms could be used between them, that led to the revolution in breakthroughs in natural language processing, starting with the advent of Watson. Peril, in 2012 ”, he continued.
Indeed, we have taught computers to speak human, so why not also teach computers to speak more about computers? This is what IBM’s CodeNet project seeks to accomplish. “We need our ImageNet, which can snowball innovation and can unleash that innovation in algorithms,” Puri said. CodeNet is essentially the ImageNet of computers. This is a large dataset designed to teach AI / ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across over 55 legacy and active languages - from COBOL and FORTRAN to Java, C ++ and Python.
“Since the dataset itself contains 50 different languages, it can actually activate algorithms for many pairs of combinations,” Puri explained. That said, work has been done in areas of human language, such as neural machine translation which, rather than doing it in pairs, actually becomes more language independent and can derive an intermediate abstraction through which it translates. in many different languages. In short, the dataset is constructed in such a way as to allow two-way translation. In other words, you can take some old COBOL code – which, terrifyingly, still makes up a significant amount of the banking and federal government infrastructure in this country – and translate it into Java as easily as you could take a snippet of Java and the regress in COBOL. .
“We believe that natural language processing and machine learning can be applied to understanding software languages by performing automated reasoning and decision making, being able to explain those decisions, just as we can do with computer vision and the natural language processing side, “he said.
But just like with human languages, computer code is created to be understood in a specific context. However, unlike our bipedal linguistics, “programming languages can be compared, very succinctly, on a metric of ‘does the program compile, does the program do what it was supposed to do wrong and, if there is a test set, he knows, solve and meet the test criteria, ”Puri said. Thus, CodeNet can be used for functions such as code search and clone detection, in addition to its intended translation functions and serving as a reference dataset. Additionally, each sample is tagged with its processor runtime and memory footprint, allowing researchers to conduct regression studies and potentially develop automated code correction systems.
The CodeNet project consists of over 14 million code samples as well as over 4,000 coding problems collected and organized from decades of programming challenges and competitions around the world. “The way the dataset was created,” Puri said, “there are many types of programming competitions and all kinds of issues – some are more practical, some are more academic. These are the languages that have been used over the past decade and a half in many of these competitions with thousands of students or competitors submitting solutions.
Additionally, users can run individual code samples “to extract metadata and verify the accuracy of generative AI model results,” according to an IBM press release. “This will allow researchers to program the equivalence of intentions when translating from one programming language to another.”
While this dataset could theoretically be used to generate entirely new code sequences, like what GPT-3 does with English, CodeNet’s strength lies in its ability to translate. “We’re trying to do exactly what ImageNet did for computer vision,” he said. “It fundamentally changed the game, it was highly organized with a very focused dataset for a very broad area. We hope that CodeNet, with its diversity of tasks, diversity of data and large scale, will deliver the same value. In addition, Puri estimates that over 80% of these presented problems each already have more than 100 variant answers, offering a wide range of possible solutions.
“We’re very excited about this,” Puri exclaimed. “We hope and believe this will be about encoding what ImageNet was for computer vision.” IBM intends to release CodeNet data in the public domain, allowing researchers around the world equal and free access.
All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through any of these links, we may earn an affiliate commission.