A team led by IT specialists from examined ten of the most cited datasets used to test systems. They found that about 3.4% of the data was inaccurate or mislabeled, which could cause problems in AI systems that use these data sets.
The datasets, which have each been cited over 100,000 times, include those based on text from focus groups, and . Mistakes have appeared due to issues such as Amazon product reviews that were mislabeled as positive when in fact negative and vice versa.
Some of the image-based errors are the result of mixing animal species. Others are due to mislabeled photos with less visible objects (“water bottle” instead of the mountain bike it’s attached to, for example). One particularly irritating example that emerged was a baby mistaken for a nipple.
focuses on audio from YouTube videos. of a YouTuber speaking to the camera for three and a half minutes was tagged as a “church bell”, even though it could only be heard in the last 30 seconds or so. Another error arose from a misclassification of as an orchestra.
To find possible errors, the researchers used a framework called , which examines the data sets for tag noise (or irrelevant data). They validated possible errors using , and found that about 54% of the data reported by the algorithm had incorrect labels. The researchers found the had the most errors with about 5 million (about 10 percent of the dataset). The team so anyone can browse for label errors.
Some of the errors are relatively minor and others appear to be a case of hair splitting (a close-up of a Mac command key labeled as a “computer keyboard” is still okay). Sometimes the confident learning approach was also wrong, such as mistaking a properly labeled picture of tuning forks for a menorah.
If the labels are even a little off, it could have huge ramifications for machine learning systems. If an AI system can’t tell the difference between a grocery store and a bunch of crabs, it would be hard to trust pour you a drink.