Data Analytics

Overfitting in Machine Learning: What is it & How Can It Be Prevented?

Building generalization into a data model is an important way for businesses to improve the success rate of their artificial intelligence efforts.

Ernie Smith

Ernie Smith is a former contributor to BizTech, an old-school blogger who specializes in side projects, and a tech history nut who researches vintage operating systems for fun.

In data analysis, it is important to take steps to build an accurate, well-considered model that can help with processes such as automation and machine learning.

But in building that model, serious gaps can emerge as a result of too much detail — and the effect can create real-world problems with how the data is interpreted.

The result is called overfitting, a major challenge in the world of data analytics and artificial intelligence. Getting a strong understanding of the problem is the first step to building a model that generates better results.

Click the banner below to unlock exclusive data analytics content when you register as an Insider.

What Is Overfitting?

In general, overfitting refers to the use of a data set that is too closely aligned to a specific training model, leading to challenges in practice in which the model does not properly account for a real-world variance.

In an explanation on the IBM Cloud website, the company says the problem can emerge when the data model becomes complex enough that it begins to overemphasize irrelevant information, or “noise,” in the data set.

“When the model memorizes the noise and fits too closely to the training set, the model becomes ‘overfitted,’ and it is unable to generalize well to new data,” the company writes. “If a model cannot generalize well to new data, then it will not be able to perform the classification or prediction tasks that it was intended for.”

So, because of its contours favoring the data that it was trained against, the data model is more likely to produce false positives or false negatives when used in the real world.

What Can Cause Overfitting?

In some ways, overfitting stems from issues with how the original data model was built, creating gaps in the machine’s understanding.

This can happen for many reasons — importantly, that a model was built for specific outcomes rather than slightly more generalized ones. (There is also a threat of the opposite problem, underfitting, which happens when the data model isn’t mature enough, creating false positives or false negatives.)

Overfitting can introduce inefficiency into the business, adding costs. For example, overfitting can lead to issues in detecting security threats to internal platforms, allowing risks to enter a network undetected. When used in data forecasts, it can create a misunderstanding of how big the need for a product is, leading to problems with how that demand is managed within the supply chain.

EXPLORE: How predictive analytics helps financial institutions manage risk.

In some cases, overfitting can represent a form of algorithmic bias, in which errors in the data model create negative outcomes for the end user — for example, if people are more likely to be denied for a loan or credit based on a predetermined level of risk that doesn’t account for their specific circumstances.

Challenges in attempting to build data models ethically reflect the importance of taking steps to avoid overfitting when bias or discrimination is a concern.

Cassie Kozyrkov, the chief decision scientist at Google, said in a 2019 presentation that a key element in battling algorithmic bias caused by methods such as overfitting is to test heavily against the available data.

“Computers have really good memory,” Kozyrkov said, according to VentureBeat. “So the way you actually test them is that you give them real stuff that’s new, that they couldn’t have memorized, that’s relevant to your problem. And if it works then, then it works.”

How Can IT Teams Test and Detect Overfitting?

Strong testing is the key factor in avoiding overfitting — and a key tell of an overfit model is that when the model is put into a real-world setting, it strongly underperforms compared with its performance against the training model it used.

As data science blogger Juan Orozco Villalobos of the website BrainsToBytes noted, the variance in how the model performs in the real world compared with the test set tells the full story.

“The easiest way to find out if your model is overfitting is by measuring its performance on your training and validation sets,” Villalobos says. “If your model performs much better with training data than with validation data, you are overfitting.” He adds that introducing more test data can help strengthen the model against such quirks over time.

However, it’s worth keeping in mind that there may be some cases in which overfitting is preferred.

The security company CrowdStrike, for example, has found that in the methods it uses to prevent malicious data, overfitting may be preferable to a more generalized approach.

“Across many problem domains, models that heavily overfit the training data perform better than the best models that do not,” writes Robert Molony, a senior data scientist at CrowdStrike, in a blog post. “This observation has been replicated across many problem domains and model architectures.”

LEARN MORE: Find out how to secure your data all the way to the endpoint.

How Can IT Teams Prevent Overfitting?

Avoiding overfitting comes down to building a strong data model and testing it heavily, using tools such as CDW Amplified^TM Data Services to help analyze the capabilities of your model.

In a blog post for the website Towards Data Science, David Chuan-En Lin, a PhD student at Carnegie Mellon University’s Human-Computer Interaction Institute, explains that a number of strategies can help prevent overfitting in data models. Among them:

For large data sets, set aside a portion of the data for testing the results of the training set. (Lin recommends that about one-fifth of the data be set aside for testing purposes.) This enables a re-creation of real-world conditions by allowing the data set to be tested against information not included in the model.
For smaller data sets, apply data augmentation to artificially increase the size of the data set. Lin notes that this approach is effective in cases of image classification, in which images can be rotated or warped to create additional variables.
For data sets with large numbers of features, simplify the number of features analyzed so that the data model is not built with a high degree of specificity.
Apply regularization techniques to the models, such as L1 or L2 regularization or eliminating layers within the model, to remove complexity from the model.

Beyond these more traditional approaches, up-and-coming technologies could also provide a potential solution to the overfitting problem, depending on the use case. For example, the GPU manufacturer NVIDIA has been building methods for using synthetic data in training deep neural networks.

Last fall, the company announced its Omniverse Replicator, a data generation engine that can help create synthetic data in use cases such as autonomous driving or robotics. In a recent interview with IEEE Spectrum, the company’s vice president of simulation technology and Omniverse engineering, Rev Lebaredian, said using synthetic data can make it easier to account for issues of algorithmic bias “because it’s much easier for us to provide a diverse data set.”

“If I’m generating images of humans and I have a synthetic data generator, that allows me to change the configurations of people’s faces, their skin tone, eye color, hairstyle, and all of those things,” he says.

nespix/Getty Images