Next up in the ForgottenTabs series is the Correlation Analysis tab. The Correlation Analysis tab provides a
correlation coefficient for any two variables in our dataset. To get these
values, simply check the boxes next to the variables you’re interested in
correlating. The resulting correlation coefficient can be either positive or
negative, and generally if the value is greater than +/- .1, we say that those
two variables are significantly correlated. Knowing how different variables are correlated can allow us to understand variable selection and create more accurate models.
Sometimes a high
correlation value can explain why a variable may not have made its way into a
final model if a similar variable did. An example of this is the correlation
between the variables “SAT Math”, “SAT Verbal”, and “HS GPA”. As indicators of
student success, you might guess that these variables have a positive
correlation – so, you would expect that a student with a relatively high HS GPA
will, in turn also have relatively high SAT scores, and vice versa. If we were
to build a model that utilized these variables, however, we would typically get
something like the following:
Here we see an “SAT
Math” variable in our final model, but “SAT Verbal” and “HS GPA” are nowhere to
be found.
Looking at these variables in the Correlation Analysis tab will
confirm our earlier guess that the variables are correlated, which in turn
explains why all three are not included in the model:
Note that each
correlation coefficient is well above the general .1 threshold of significant
correlation, meaning that these variables are, in fact, strongly correlated. This
correlation is accounted for when we build our predictive models, so that if a
change in one generally brings about a change in another, Analytics will pick
the stronger predictor of the two and leave the other out.
Another thing to
check for in a correlation analysis is for perfect predictors. If a variable
pair has a correlation equal to one, you’ll know that those variables are
perfect predictors of each other. Some common examples are retention and
attrition, and housing deposit and enrollment. These things are perfect
predictors of each other because retention is the opposite of attrition, and
you typically need to enroll to make a housing deposit. Using the Correlation Analysis
tab can tell you if you do have any perfect predictors; if you do, you should
be sure to take one of the variables out of the analysis.
-Caitlin Garrett, Statistical Analyst at Rapid Insight
-Caitlin Garrett, Statistical Analyst at Rapid Insight
Thanks for providing such an useful information on correlation topic.The basics of this topic are cleared in a illustrative manner.
ReplyDelete