Thursday, August 16, 2012

The Forgotten Tabs: Correlation Analysis

Next up in the ForgottenTabs series is the Correlation Analysis tab. The Correlation Analysis tab provides a correlation coefficient for any two variables in our dataset. To get these values, simply check the boxes next to the variables you’re interested in correlating. The resulting correlation coefficient can be either positive or negative, and generally if the value is greater than +/- .1, we say that those two variables are significantly correlated. Knowing how different variables are correlated can allow us to understand variable selection and create more accurate models. 

Sometimes a high correlation value can explain why a variable may not have made its way into a final model if a similar variable did. An example of this is the correlation between the variables “SAT Math”, “SAT Verbal”, and “HS GPA”. As indicators of student success, you might guess that these variables have a positive correlation – so, you would expect that a student with a relatively high HS GPA will, in turn also have relatively high SAT scores, and vice versa. If we were to build a model that utilized these variables, however, we would typically get something like the following: 

Here we see an “SAT Math” variable in our final model, but “SAT Verbal” and “HS GPA” are nowhere to be found. 

Looking at these variables in the Correlation Analysis tab will confirm our earlier guess that the variables are correlated, which in turn explains why all three are not included in the model: 

Note that each correlation coefficient is well above the general .1 threshold of significant correlation, meaning that these variables are, in fact, strongly correlated. This correlation is accounted for when we build our predictive models, so that if a change in one generally brings about a change in another, Analytics will pick the stronger predictor of the two and leave the other out.

Another thing to check for in a correlation analysis is for perfect predictors. If a variable pair has a correlation equal to one, you’ll know that those variables are perfect predictors of each other. Some common examples are retention and attrition, and housing deposit and enrollment. These things are perfect predictors of each other because retention is the opposite of attrition, and you typically need to enroll to make a housing deposit. Using the Correlation Analysis tab can tell you if you do have any perfect predictors; if you do, you should be sure to take one of the variables out of the analysis. 

-Caitlin Garrett, Statistical Analyst at Rapid Insight

1 comment:

  1. Thanks for providing such an useful information on correlation topic.The basics of this topic are cleared in a illustrative manner.