Tuesday, December 4, 2012

On Automated Mining


One of the things I love the most about using statistical modeling software (especially Analytics) is that so much of the process is automated. Although automation has made the lives of statisticians much easier (calculating individual standard errors by hand would take hours for each variable), it is still important to be familiar with the methods and thinking that go into the variable selection process. One tab that does a lot of statistical heavy lifting for us is the Automated Mining tab, and I thought it would be good to explore some of the tests that are being used in that tab.

The function of the Automated Mining tab is to determine, variable by variable, which variables are statistically related to the selected y-variable, and which are not. The statistical test will vary from pair to pair depending on the types of variables being compared. One thing that is important to note is that we’re not doing any modeling or looking at the relationships between x-variables yet. The Automated Mining tab and its tests are only deciding which variables have the possibility of being in the predictive model, not which ones will be.

Depending on the types of x- and y-variables involved, one of three tests will be used to decide how related each variable pair is. These possibilities include a Chi-Square test, a Z-test, or an F-test. 


Variable Under Evaluation

Y-Variable
Binary
Continuous
Categorical
Binary
Z-Test
Decile-Chi-Square
Z-Test
Continuous
Z-Test
Decile-F-Test/ANOVA
Z-Test
Categorical
n/a
n/a
n/a

Chi-Square Test
A chi-square test is performed for any continuous x-variables used to predict a binary y-variable. In our Automated Mining tab, this test is performed on each of 10 deciles to determine whether or not the ‘ones’ are randomly distributed across the deciles. This test is more robust than using a linear correlation, as it captures non-linear relationships as well as relationships that are not well fit by a curve or line.

Z-Test
A Z-test is used for any binary or categorical predictors, regardless of the type of y-variable they’re trying to predict. It tests whether any category is significantly different in terms of the Y (relative to all other categories).

F-Test
An F-test is used whenever you have a continuous x-variable trying to predict for a continuous y-variable. In our Automated Mining tab, the data is sorted into deciles and an ANOVA test is run on these deciles to determine if the means are statistically different. This is more robust than a linear correlation, as it captures non-linear relationships and those that do not fit a standard curve.

Once each of these tests is performed at the specified level of significance, we have a narrowed-down dataset of only variables that are statistically related to our y-variable, which brings us one step closer to figuring out which variables historically have the most influence on our y-variable and will end up in our final predictive model. 

-Caitlin Garrett, Statistical Analyst at Rapid Insight 


No comments:

Post a Comment