One of the things I love the most about using statistical
modeling software (especially Analytics) is that so much of the process is
automated. Although automation has made the lives of statisticians much easier
(calculating individual standard errors by hand would take hours for each
variable), it is still important to be familiar with the methods and thinking
that go into the variable selection process. One tab that does a lot of
statistical heavy lifting for us is the Automated Mining tab, and I thought it
would be good to explore some of the tests that are being used in that tab.
The function of the Automated Mining tab is to determine,
variable by variable, which variables are statistically related to the selected
y-variable, and which are not. The statistical test will vary from pair to pair
depending on the types of variables being compared. One thing that is important
to note is that we’re not doing any modeling or looking at the relationships
between x-variables yet. The Automated Mining tab and its tests are only
deciding which variables have the possibility
of being in the predictive model, not which ones will be.
Depending on the types of x- and y-variables involved, one
of three tests will be used to decide how related each variable pair is. These
possibilities include a Chi-Square test, a Z-test, or an F-test.
Variable
Under Evaluation
|
|||
Y-Variable
|
Binary
|
Continuous
|
Categorical
|
Binary
|
Z-Test
|
Decile-Chi-Square
|
Z-Test
|
Continuous
|
Z-Test
|
Decile-F-Test/ANOVA
|
Z-Test
|
Categorical
|
n/a
|
n/a
|
n/a
|
Chi-Square Test
A chi-square test is performed for any continuous
x-variables used to predict a binary y-variable. In our Automated Mining tab,
this test is performed on each of 10 deciles to determine whether or not the
‘ones’ are randomly distributed across the deciles. This test is more robust
than using a linear correlation, as it captures non-linear relationships as
well as relationships that are not well fit by a curve or line.
Z-Test
A Z-test is used for any binary or categorical predictors,
regardless of the type of y-variable they’re trying to predict. It tests
whether any category is significantly different in terms of the Y (relative to
all other categories).
F-Test
An F-test is used whenever you have a continuous x-variable
trying to predict for a continuous y-variable. In our Automated Mining tab, the
data is sorted into deciles and an ANOVA test is run on these deciles to
determine if the means are statistically different. This is more robust than a
linear correlation, as it captures non-linear relationships and those that do
not fit a standard curve.
Once each of these tests is performed at the specified level
of significance, we have a narrowed-down dataset of only variables that are
statistically related to our y-variable, which brings us one step closer to
figuring out which variables historically have the most influence on our
y-variable and will end up in our final predictive model.
-Caitlin Garrett, Statistical Analyst at Rapid Insight
-Caitlin Garrett, Statistical Analyst at Rapid Insight
No comments:
Post a Comment