Skip To Content

5 Mistakes to Avoid in Data Prep for Predictive Modeling

Reading time: 3 minutes
5 Mistakes to Avoid in Data Prep for Predictive Modeling

By Rapid Insight

Here at Rapid Insight, we assist professionals of all backgrounds (from business analysts to institutional researchers) in building predictive models for their organizations. The models themselves often differ widely depending on who we’re working with, but we’ve observed a few data prep stumbling blocks the pop up across many predictive modeling projects. Here are the five data prep mistakes we see most often (and advice on avoiding them)!

Mistake 1:

Including ID Fields as Predictors

Most ID fields (such as a customer or user ID number) look like continuous integers (i.e., Customer 143, Customer 532, etc.). If you use automated predictive modeling software, IDs may make their way into your model as a predictive variable. Be sure to exclude IDs or disregard correlations as early in the process as possible to avoid skewing your model’s results.

Mistake 2:

Including Duplicate Records

Don’t include duplicates in a modeling file. Duplicate records give individual datapoints twice as much predictive power, which can harm your model’s accuracy. It never hurts to dedupe your modeling file before you start building a predictive model.

Including Duplicate Records

Mistake 3:

Using Anachronistic Variables

Sometimes, a historic dataset contains fields that continue to update after the target outcome would have occurred. In these cases, the attribute can become a surrogate for having achieved the outcome. For example, if you are conducting a monthly customer retention analysis, customers who canceled service will not continue to call customer service, while renewing customers might have a continually increasing number of calls to customer service. In this case, it’s not guaranteed that every call to customer service took place before the customer decided to renew. “Calls to customer service” might therefore be considered an anachronistic variable.

Mistake 4:

Modeling on Too Small of a Population

Double-check your population size. A good baseline for a modeling dataset is 1,000 records spanning three cycles. Including at least three cycles helps to account for any cycle-to-cycle fluctuations. The larger your population size is, the most robust your model will be.

Modeling on Too Small of a Population

Mistake 5:

Failing to Account for Outliers and Missing Values

Be sure to identify and account for outliers and missing values in your dataset. Values that fall far outside the normal range or contain nulls add up when you’re combining data to build a predictive model. Check the minimum and maximum values for each variable to spot any records that fall outside the norm, and fill in null values (when appropriate) to ensure your model isn’t dragged in a certain direction artificially. 

Better Data Prep = Better Predictive Modeling Results

Thoroughly reviewing and prepping your data (then trying to automate as much of that as possible!) before building a predictive model will ensure your results’ integrity, and the right data tools make that much easier. Rapid Insight’s data analysis tools will help you:

  • Make fewer mistakes by automatically detecting and correcting duplicates and missing values.
  • Work faster with an intuitive interface and repeatable data prep workflows.
  • Get better results by working with our expert, in-house analyst support team. 

Interested in learning more? Click the button below to schedule a live demo!


Notify of
Inline Feedbacks
View all comments