Failing to consider
enough variables
When deciding which variables to audition for a model, you
want to include anything you have on-hand that you think could possibly be
predictive. Weeding out the extra variables is something that your modeling
program will do, so don’t be afraid to throw the kitchen sink at it for your
first pass.
Not hand-crafting
some additional variables
Any guide-list of variables should be used as just that – a
guide – enriched by other variables that may be unique to your institution. If there are few unique variables to be had,
consider creating some to augment your dataset. Try adding new fields like
“distance from institution” or creating riffs and derivations of variables you
already have.
Selecting the wrong
Y-variable
When building your dataset for a logistic regression model,
you’ll want to select the response with the smaller number of data points as
your y-variable. A great example of this from the higher ed world would come
from building a retention model. In most cases, you’ll actually want to model
attrition, identifying those students who are likely to leave (hopefully the smaller group!) rather than those who are
likely to stay.
Not enough Y-variable
responses
Along with making sure that your model population is large
enough (1,000 records minimum) and spans enough time (3 years is good), you’ll
want to make sure that there are enough Y-variable responses to model. Generally,
you’ll want to shoot for at least 100 instances of the response you’d like to
model.
Building a model on
the wrong population
To borrow an example from the world of fundraising, a model
built to predict future giving will look a lot different for someone with a
giving history than someone who has never given before. Consider which
population you’d eventually like to use the model to score and build the model
tailored to that population, or consider building two models, one for each
sub-group.
Judging the quality
of a model using one measure
It’s difficult to capture the quality of a model in a single
number, which is why modeling outputs provide so many model fit measures.
Beyond the numbers, graphic outputs like decile analysis and lift analysis can provide
visual insight into how well the model is fitting your data and what the gains
from using a model are likely to be.
If you’re not sure which model measures to focus on, ask
around. If you know someone building models similar to yours, see which ones
they rely on and what ranges they shoot for. The take-home point is that with
all of the information available on a model output, you’ll want to consider
multiple gauges before deciding whether your model is worth moving forward
with.
-Caitlin Garrett, Statistical Analyst at Rapid Insight
Photo Credit: http://www.flickr.com/photos/mattimattila/
Have you made any of the above mistakes? Tell us about it (and how you found it!) in the comments.
Photo Credit: http://www.flickr.com/photos/mattimattila/
Have you made any of the above mistakes? Tell us about it (and how you found it!) in the comments.
No comments:
Post a Comment