1. Including ID Fields as Predictors
Because most IDs look like continuous integers (and older
IDs are typically smaller), it is possible that they may make their way into
the model as a predictive variables. Be sure to exclude them as early on in the
process as possible to avoid any confusion while building your model.
2. Using Anachronistic Variables
Make sure that no predictor variables contain information
about the outcome. Because models are built using historical data, it is
possible that some of the variables you have accessible when building your
model were not available at the time the model is built to reflect. No
predictor variables should be proxies for your dependent variable (ie: “made a
gift” = donor, “deposited” = enrolled).
3. Allowing Duplicate Records
Don’t include duplicates in a model file. Including just two
records per person gives that person twice as much predictive power. To make
sure that each person’s influence counts equally, only one record per person or
action being modeled should be included. It never hurts to dedupe your model file before you start building a predictive model.
4. Modeling on Too Small of a Population
Double-check your population size. A good goal to shoot for
in a modeling dataset is at least 1,000 records spanning three years. Including
at least three years helps to account for any year-to-year fluctuations in your
dataset. The larger your population size is, the most robust your model will
be.
5. Not Accounting for Outliers and/or Missing Values
Be sure to account for any outliers and/or missing values.
Large rifts in individual variables can add up when you’re combining those
variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.
[photo credit]
-Caitlin Garrett, Statistical Analyst at Rapid Insight
[photo credit]
-Caitlin Garrett, Statistical Analyst at Rapid Insight
No comments:
Post a Comment