Friday, July 26, 2013

Predicting Retention for Online Students: Where to Start

With the rise of enrollment in online programs and MOOCs, we’re seeing more and more students forego traditional classroom experiences in favor of more flexible online programs. With this shift comes a whole new set of guidelines for enrollment management, financial aid, and retention programs. Retention, in particular, has seen a significant downward trend as learning moves from in-person to online classrooms.

My interest lies in figuring out what variables might be worth including in an analysis attempting to predict online student retention. I did a bit of research and was hoping to find a list of variables online that had worked in the past but couldn’t find any comprehensive resource, so I’ve started to build my own. In the sections below, I’ve listed the type of information that I think would be worth analyzing broken out into four separate categories. Some of these are variables in and of themselves, and some can be broken down different ways; for example, “age” can be used by itself, but creating a “non-traditional age” flag is useful as well. Realistically, not all schools will have all of this information, so this list is meant to be a good starting point of what to shoot for when collecting data.

Also, if you have any variables to add (and I’m sure there are some I’ve missed), I’d love to hear about them in the comments. 

Student Demographic Information
  • Socioeconomic status / financial aid information
    • FAFSA info, Pell eligibility, any scholarship or award info
  • Ethnicity
    • Minority Status
  • Gender
  • Home state
  • Distance from physical campus (if applicable)
  • Age; traditional or non-traditional?
  • Military background?
  • Have children?
  • Currently employed full-time?
  • First generation college student?
  • Legacy student? (Did a parent/grandparent/sibling attend?)

Student Online Learning History
  • Registered for classes online or in person?
  • How many days did they register before the start of the term?
  • Ever attended a class on-campus?
  • Do they plan to attend both online and on-campus classes?
  • Did they attend any type of orientation?
  • Number of previous online courses taken
    • First-time online learner?

Student Academic History
  • GPA
  • SAT/ACT scores
  • Degree hours completed
  • Degree hours attempted
  • Taking developmental courses?
  • Transfer student?
  • Degree program / major 
  • Program level (Associate, Bachelors, Masters, etc.)
  • Number of program or major changes (if applicable)
  • Any previous degrees?

Course- and Program- Related
  • Amount of text vs. interactive content 
  • Lessons with immediate feedback?
  • Any peer-to-peer forum for interaction?
  • Lessons in real time or recorded?
  • Amount of teacher interaction with students
    • Chat, email exchange, turn-around time on assignments

Closing notes:

Getting course-related data might be difficult, but the variables I listed above are derived from studies about how to improve online courses as being areas to focus on; my thinking is that the more engaged a student is, both with peers and instructors, the better their chances of online success are. If you have the data available, it would be worth trying to incorporate it into your model dataset to see whether or not it is predictive.

Rather than using retention as a y-variable when building these models, we typically create an attrition variable (exactly the opposite of retention) and use that as our y instead. This way, we're getting more directly at the characteristics of a student who is likely to leave rather than stay.

Typically when building attrition models, I create separate models for freshmen and upperclassmen. I’d suggest doing that here as well, since previous online coursework will probably be a good indicator of future online coursework. In that case, you’d want to take out many of the variables listed above when modeling freshmen retention.

Finally, it’s important to keep in mind that student success has different meanings for different institutions. You could be basing success on # of credits completed, transitions from semester to semester, or a particular GPA cutoff, among other indicators. When building these different types of student success models, you will probably need to tailor some of these variables to fit the model you're building.

-Caitlin Garrett is a Statistical Analyst at Rapid Insight


  1. Hi,Caitlin.

    Under Course-Related, you may want to include:

    1) How many times has the course been offered previously?
    2) What was the attrition rate the last time the course was offered?

    These two factors might well account for 90% of the variation!


  2. Caitlin:

    I don't know if you are interested in semi-real time (semi because the lag could be weeks or more) analytics or not, but some of the best predictors of student success in MOOCs are variables like score on the first quiz, amount of participation in the first week, etc. The primary use for retention predictions (I would assume) is to inform interventions. If you look at a MOOC active student curve, it tends to sharply drop off within the first few weeks, and often after the first week. Adding in predictors as the data becomes available would help you target those students before and as they drop off.


    P.S. The internet is a very small place, Bruce, nice to see you are interested in this issue.

  3. Bruce,

    Those are both great suggestions. Thanks for your input!


  4. Vik,

    I think that depending on how sophisticated an institution gets with predictive modeling, doing a real-time analysis can be very useful. The variables you mentioned are great. Thanks for sharing!


  5. Caitlin, I would recommend adding Home School yes/no. Sometimes there is missing HS GPA and ACT/SAT and a HS flag can help.