Rapid Insight: Data Analytics

Tuesday, January 22, 2013

Dealing with Nulls in Veera Transform Formulas

This next post comes from Jeff Fleischer, our Director of Client Operations, support wiz, and analyst extraordinaire:

Working out the logic of a new variable you want to create with a TRANSFORM node can be challenging. But when missing data ("nulls") get into the mix, it can be especially confusing and frustrating. For example, if you'd written the conditional formula...

IF ([A]='Freshman', 'UG', 'Grad')

...and some of the fields under column [A] were null, you would get nulls as an output for those rows rather than the desired 'Grad'. This is because trying to equate something with "nothing" confuses Veera as to what you would really want as a result. So here are some suggestions on how best to deal with those gaps and still get to the outcome you need...

1.
The most obvious way to deal with gaps in data is to replace them with something. This may not always be desirable, but when it is, using a CLEANSE ahead of your TRANSFORM is your best bet. Select the "Is Missing" operator and use Alt-Left Mouse to select all the columns that need their data fields filled in with that new value, like 'unknown'.

Of course, you could instead place a CLEANSE after your TRANSFORM, using it to fill in any missing values appearing in the new column.

2.
If filling in those data holes using a cleanse is not preferable, maybe just a temporary patch will do. Look for the "Treat Missings in Formula as Zeros" checkbox just above the "New Variable Name" field in the TRANSFORM. Just as the name suggests, this will temporarily replace any missing data with a zero, allowing most operations to function. Be careful, though, if the column you're evaluating already contains zeros - the output may not be what you intended!

3.
If even temporarily replacing nulls with something else isn't an option, then change your TRANSFORM formula to deal with them ahead of everything else. To do this, you'll likely need to use one of two built-in Veera functions - IS NULL or IS NOT NULL. We might change our example to include another condition, such as...

IF ([A] IS NULL, 'Withdrawn',
IF ([A]='Freshman', 'UG', 'Grad'))

The idea here is to catch any nulls before they affect the rest of the logic by putting that condition first.

4.
Finally, another (if more specialized) option might be to use the "Missings:" TRANSFORM feature. Unlike the "Treat Missings in Formula as Zeros" checkbox, this control changes nulls that appear as the final result of a formula. The replacement options offered by this feature are limited (0 or 1), but it may be an easy way to fix a problem with absent data appearing in a new numeric field.

-Jeff Fleischer

Wednesday, January 16, 2013

How to Interpret a Decile Analysis

After building a predictive model, there are several ways to determine how well the model is describing your data. One visual way to get an idea of how well a model is fitting your data is by taking a look at the decile analysis. Here we’ll take a look at what the decile analysis represents, how it’s created, and how to spot a good model.

What a Decile Analysis Represents

After building a statistical model, a decile analysis is created to test the model’s ability to predict the intended outcome. Each column in the decile analysis chart represents a collection of records that have been scored using the model. The height of each column represents the average of those records’ actual behavior.

How the Decile Analysis is Calculated

1. The hold-out or validation sample is scored according to the model being tested.

2. The records are sorted by their predicted scores in descending order and divided into ten equal-sized bins or deciles. The top decile contains the 10% of the population most likely to respond and the bottom decile contains the 10% of the population least likely to respond, based on the model scores.

3. The deciles and their actual response rates are graphed on the x and y axes, respectively.

After the decile analysis is built, you’ll want to take a look at the height of the bars in relation to one another. Deciding whether a model is worth moving forward with depends on the pattern you see when viewing the decile analysis.

Ideal Situation: The Staircase Effect

When you’re looking at a decile analysis, you want to see a staircase effect; that is, you’ll want the bars to descend in order from left to right, as shown below.

This is telling you that the model is “binning” your constituents correctly from most likely to respond to least likely to respond. A model exhibiting a good staircase decile analysis is one you can consider moving forward with.

Not-So-Ideal Situations

In contrast, if the bars seem to be out of order (as shown below), the decile analysis is telling you that the model is not doing a very good job of predicting actual responses.

If the bars seem to be the same height, or the decile analysis looks “flat”, the decile analysis is telling you that the model isn’t performing any better than randomly binning people into deciles would. In both cases, your model should be improved before moving forward with it.

-Caitlin Garrett, Statistical Analyst at Rapid Insight

Thursday, January 10, 2013

Valuing Analytics & Predictive Modeling in Higher Ed

As promised, here is part two of my interview with Mike Laracy, Founder, President, and CEO of Rapid Insight. Mike's 20+ years of data analytics & predictive modeling experience have provided him with many insights. Here's Mike on becoming more data-driven in higher education, which models produce the highest ROI, and mistakes to avoid:

Where does predictive modeling fit into the analytic ecosystem in higher education?

Within the analytic ecosystem in higher ed, there is a range of ways in which data is analyzed and looked at. On one side, you have historical reporting, which our clients do a lot of and is vital to every institution. Somewhere in the middle is data exploration and analysis, where you’re slicing and dicing data to understand it better or make more informed decisions based on what happened in the past. On the other side of the spectrum is predictive modeling. Modeling requires taking a look at all of the variables in a given set of information to make informed predictions about what will happen in the future. What is each applicant’s probability of enrolling or what is each student’s attrition likelihood? What will the incoming class look like based on the current admit pool? These are the types of questions that are being answered in higher ed with predictive analytics. The resulting probabilities can also be used in the aggregate. For example, enrollment models allow you to predict overall enrollment, enrollment by gender, by program, or by any other factor. The models are also used to project financial outlay based on the financial aid promised to admitted applicants and their individual enrollment probabilities.

Higher education has come a long way in the last five to ten years in its use of predictive analytics. The entire student life cycle is now being modeled starting with prospect and inquiry modeling all the way through to alumni donor modeling. It used to be that any institutions that were doing this kind of modeling were relying on outside consulting companies. Today most are doing their modeling in-house. Colleges and universities view their data as a strategic asset and they are extracting value from their data with the same tools and methodologies as the Fortune 500 companies.

What kinds of resources are needed and what is the first step for an institution who wants to become more data-driven in their decision making?

It’s important to have somebody who knows the data. As long as a user has an understanding of their data, our software makes it very easy to analyze data and build predictive models very quickly. And our support team is available to answer any analytic questions.

Gaining access to their data is the first step. We see a lot of institutions that have some reporting tools which don’t allow them to ask new questions of the data. So, they might have a set of 50 reports that they’re able to run over and over but anytime someone has a new question, without access to the raw data there’s no way to answer the question.

It really helps if the institution is committed to a culture of data driven decision making. Then all the various stakeholders are more focused on ensuring data access for those doing the predictive modeling.

What do you say to those who are on “the quest for perfect data”? Is it okay to implement predictive analytics before you have that data warehouse or those perfectly cleansed datasets?

No institution is ever going to have perfect data, so you work with what you have. We suggest seeing what you have, finding any obvious problems in the data, and then fixing those problems the best you can. We’ve designed our solutions such that a data warehouse is not required but, even with a clean data warehouse, the data is never going to be perfect. As long as you as you have an understanding of the data, you can move forward.

In your experience, which models in higher education produce the highest ROI?

We have a customer, Paul Smith’s College that has quantified their retention modeling efforts. Using their model results, they put programs into place to help those students that were predicted to be high-risk of attrition. They credit the modeling with helping them identify which students to focus on, saving them $3m in net tuition revenue so far.

We have other clients that are using predictive modeling on the prospect side and they’re realizing significant savings on their recruiting efforts. So instead of mailing to 200,000 high school seniors, they’re mailing to 50,000, and realizing significant savings by not mailing and not calling those students who have pretty much zero probability of applying or enrolling.

Although not as easily quantifiable, enrollment modeling has a pretty big ROI. Not only on determining which applicants are likely to enroll, but in predicting class size. If an institution overshoots and enrolls too many applicants, they’ll have dorm, classroom, and other resource issues. If enroll too little, they’ll have revenue issues. So predicting class size and determining who and how many applicants to admit is extremely important.

What are some common mistakes you see when approaching predictive modeling for your higher ed customers?

One mistake that I often see is when information is thrown out as not useful to the models. Zip code is a good example. Zip code looks like a five digit numeric variable, but you wouldn’t want to use it as a numeric variable in a model. In some cases it can be used categorically to help identify applicants’ origins, but its most useful purpose is to for calculating a distance from campus variable. This is a variable that we see showing up as a predictor in many prospect/ inquiry models, enrollment models, alumni models, and even retention models. Another example of a variable that is often overlooked is application date. Application date often contains a ton of useful information if looked at correctly. It can be used to calculate the number of days between when the application was sent and the application deadline. This piece of information can tell you a lot about an applicant’s intentions. A student who gets their application in the day before the deadline probably has very different intentions than a student who applies nine months before the deadline. This variable ends up participating in many models.

To get our customers up to speed on best practices in predictive modeling we’ve created resources like lists of recommended variables for specific models and guides on how to create useful new variables from existing data.

Tuesday, January 8, 2013

Defining Rapid Insight

I recently had the opportunity to sit down with Mike Laracy, President and CEO of Rapid Insight to ask him a few questions about analytics in higher education, predictive modeling, and Rapid Insight. I’ll be posting the interview as a two part series here on the blog (with part two located here). The first part is the story of Rapid Insight – how it started, what we do, and where we’re going – enjoy!

Rapid Insight has been around since 2002. Can you tell us a bit of the story on how the company came to be?

I had been doing a lot of work in the analytic space using software tools like SAS and SPSS. I found predictive modeling to be such a clunky, painful process and I knew there had to be a more efficient way to analyze data and build predictive models. Working as an analytic consultant, I had the opportunity to see how lots of companies were interacting with their data. Even the large Fortune 500 companies were struggling to analyze their data and build models. The problem was that the only tools available were tools that had been developed decades earlier for programmers and academic researchers.

I had been living in Boulder, Colorado when I developed the concept of Rapid Insight. I spent a lot of time thinking through the predictive modeling process and figuring out how it could be automated and streamlined. I sat on the concept for a couple of years before actually starting the company.

In 2002 I had moved here to North Conway and decided to rent some office space to start developing the concept of Rapid Insight into an actual software product. For the first six months it was just me. I spent that time writing the algorithms and developing a working prototype. I wasn’t a programmer and I knew that to turn the software into a commercial application, I’d need more help. I hired a software developer who is still with the company today as our lead engineer. A year later we hired another developer. In 2006 we hired our first salesperson, launched Rapid Insight Analytics, and we’ve been growing ever since.

Do your products focus exclusively on predictive analytics?

Our products also focus on ad hoc analysis and reporting. In 2008, we launched our second product called Veera. Whereas Rapid Insight Analytics automates and streamlines the process of predictive modeling and analysis, Veera focuses on the data. Data is typically scattered between databases, text files and spreadsheets, with no easy way to organize it and piece it together for modeling and analysis. Veera solves that problem. It’s a data agnostic technology that allows access to any database and any file format and makes it easy for people to integrate, cleanse, and organize their data for modeling, reporting, or simply ad hoc analysis.

We initially developed this technology as a tool to organize data for predictive modeling. We’re now seeing enormous demand for the tool as a standalone technology as well. Colleges and universities use it for reporting and ad hoc analysis. Companies like Choice Hotels and Amgen use it for processing analytic datasets with data coming from disparate sources. Healthcare organizations are using it for reporting and performing ad hoc analyses on their databases. Defense contractors are using it for cyber security.

What makes your company different from others working in the higher ed space?

In higher ed there are consulting companies that provide predictive modeling services. You send them your data, and they build a model and send you back the model and a report. But the institution still has to do the prep work to create the analytic file, which is 90% of the effort. This process is both expensive and time-consuming, and the knowledge gained from the analysis isn’t always transferred back. By bringing predictive modeling in-house, changes can be made on the fly without having to send data anywhere and models can be changed and updated very quickly, which is important because modeling is such an iterative process.

We provide schools with a means of doing this analysis and building their own models. One advantage is that the knowledge is always captured internally. But the biggest advantage is the ability for institutions to be able to ask questions of their data and answer them on the fly.

As far as other software products that are being used in higher ed, we’re very different from tools like SAS or SPSS in that the users don’t need to be programmers or statisticians to build models using our tools. I think if you ask the question of our customers you’d find that one of our biggest differentiators from these types of products is our customer support. Our analysts are available to help our clients with any questions as they build models, analyze data, or create reports. Whether the questions pertain to using our technology or about interpreting the results, we are always available to help. We want to ensure that our customers grow their own analytic sustainability.

...click here for Part Two, where Mike shares more about predictive modeling in higher education.

Monday, January 7, 2013

Thoughts from a Registrar

Dan Wilson, Registrar at Muskingum University, recently talked with us about some of the reports he's been working on, how he's using Veera, and his upcoming webinar.
CG - What types of reports are usually on your plate?

DW - Some of the reports I'll be talking about in the webinar include:

historical registrations by date,
historical majors by semester,
number of students still needing to take specific general education courses,
graduation persistence by major,
retention rates by various factors, and
DWF (drop/withdraw/fail) rates by course.

I have one report for each of these and I make minor adjustments to it each time a new question is asked.

CG - How has Veera helped with your reporting?

DW - It's helped me develop complex reports that would have taken me 4-6 hours each to get all the data, build, and run. Now I pull up a report and run it in about a minute. It's especially useful for complicated and repetitious reports.

All of the reports I've mentioned have been automated in Veera. Everything that I can, I automate. I anticipate that people might be asking for DWF rate for the first year students, or by division, or by course, or by phase of the moon. Veera is good at pulling that data together and allowing me to tweak it and make adjustments as needed. I usually choose to work with Veera when I think I'll see a lot of revisions, need to do some digging around, or can see similar questions being framed differently.

The year-over-year registrations by date report was one of the first I created using Veera. It's proven to be one of the most valuable to our administration in helping improve our retention rates. It literally takes two minutes to run - it actually takes longer to download the data file than it does to run the report in Veera.

CG - What do you hope attendees will take away from your webinar?

DW - I hope each person will find something that is a spark moment for them. Some people need exposure to Veera and to see what it can do. For users, I'm hoping to spark a brainstorm on how and why to use Veera - as the way to achieve what they need faster and easier.

CG- Anything else you'd like to add?

DW - Anyone who knows what a small college registrar does will understand that I wear many, many hats. Though I'm not an institutional researcher, some of the work that I do is borderline IR. Typically I'm asked a question and need to get someone the answer quickly. That's what I use Veera for the most.

For more information or to register for Dan's upcoming webinar, Digging Deep into Data: How a Small University's Registrar Develops Complex and Repeatable Mission-Critical reports, click here.