Tuesday, December 11, 2012

Five Data Preparation Mistakes (and How to Avoid Them!)

After building many predictive models in the Rapid Insight office and helping our customer build many more models outside of the office, we have a list of data preparation mistakes that could fill a room. Here are some of the most common ones we've seen:


1. Including ID Fields as Predictors
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building your model.

2. Using Anachronistic Variables
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect. No predictor variables should be proxies for your dependent variable (ie: “made a gift” = donor, “deposited” = enrolled).

3. Allowing Duplicate Records
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included. It never hurts to dedupe your model file before you start building a predictive model. 

4. Modeling on Too Small of a Population
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in your dataset. The larger your population size is, the most robust your model will be. 

5. Not Accounting for Outliers and/or Missing Values
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when you’re combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm. 

[photo credit]

-Caitlin Garrett, Statistical Analyst at Rapid Insight

No comments:

Post a Comment