Monday, December 31, 2012

Customer Tips: Roundup Edition

Here are a few extra tips for a roundup edition of our Customer Tips series to get your new year off to a good start!

Sometimes it's helpful in an analysis to recode variables, like recoding a binary variable for retention to a binary variable for attrition. 
- Jean Constable, Director of Institutional Research at Texas Lutheran University

Problem: a new input file every two weeks needs to be processed by the same job time after time.
Solution: Use a generic file name for the file in the input node. Copy the weekly input file to the generic file name and run the job. This works as long as the input files have an identical format. 
- William Anderson, CIO at Saint Michael's College

If you throw a filter between your input and the merge even if not populated when you change the input file, you do not lose the merge connections or fields!
-Loralyn Taylor, Registrar Director of Institutional Research at Paul Smith's College

Wednesday, December 19, 2012

Rapid Insight's Holiday Wishlist

As the holiday season is in full swing, we at Rapid Insight have taken the opportunity to put together a wishlist of things we’d like to see more of in the future.

We’re envisioning a world where…

…there are no hidden network firewalls – and no need for them.  – Jeff Fleischer, Director of Client Operations

...we get to eat more Lindt chocolate. – Tricia Mills, Account Management Team

…our customers have the budget to purchase the tools and hire the staff to properly serve their student population.  – John Paiva, Account Management Team

…nobody is burdened with clunky tools like SAS and SPSS. – Mike Laracy, President and CEO

…data comes perfectly cleansed and ready for model building. – Caitlin Garrett, Statistical Analyst

…data analysts are fearless in their pursuit of using data to drive good decisions. –Sheryl Kovalik, Director of Operations and Business Development

…more people re using Rapid Insight. – Chris Major, Sales Team

…there’s more candy! – Julie Crawford, Account Management Team

Now that you have our wishlist, we'd love to know: what's on yours?

Wednesday, December 12, 2012

Thoughts from a Reporting Wiz

Scott Alessandro of the MIT Sloan School of Management is a lover of ad-hoc reporting and coffee ice cream.  In anticipation of his webinar on Friday, we asked him a few quick questions about his day-to-day analytic life. 

CG - What types of analytic requests do you handle?

SA - Some ad-hoc requests and some internal reports for my office, including degree requirement completion, grade distribution reports, enrollments by programs, GPA comparisons among courses or programs, impact of student population on enrollment/availability, etc. 

CG - What is your typical response time?

SA - Much faster now [with Veera] than beforehand. In the past, it would take me at least a couple of hours of unbroken time to create a report - which means a while. With Veera, unbroken time doesn't make a bit of difference. 

CG - Have you seen your decision-making become more data-driven with Veera?

SA- Most definitely. I hoped that it was always data-driven, but now because I have such easy access to data, it allows me to answer more questions, or anticipate more questions. 

CG - What do you hope attendees will take away from your webinar?

SA - That we have a lot of data and the problem was not having the time to use it or go through it. That's what Veera allows us to do. Since it's a visual tool, it becomes that much more accessible for people who are not as data-inclined. 

CG - Anything else you'd like to add?

SA - When I have Veera on at home, even my kids are impressed. It looks really neat. There's something artistic about it and that's why I like it. 


We are pleased to present Scott's webinar, From Data to Decisions: Ad Hoc Analytics and Reporting with Rapid Insight Veera, on how he is utilizing Veera efficiently to respond to the wide range of analytic demands confronting him daily. The webinar will take place on Friday, December 14th from 11am - 12pm EST. 

For more information about Scott's webinar, or to register, click here

For more information about Scott, read on:

Scott Alessandro is the Associate Director of Educational Services at MIT Sloan School of Management. His main responsibilities entail overseeing the Registration Team, managing MIT Sloan’s course bidding system, and reacting to various data requests. In previous lives, he has worked at Boston University (running summer pre-college programs), Temple University (in the Honors Program and the Undergraduate Admissions Office), and the College Board (coordinating AP workshops and data reporting). All of his jobs have combined numbers and people, which has made Scott quantifiably and qualitatively happy. Outside of work, Scott satisfies his curiosity and finds entertainment in hiking, woodworking, playing sports, watching sports (though has not found as much happiness lately rooting for the Chicago Bears and New York Mets), and trying to stay one step ahead of his two young children. 

Tuesday, December 11, 2012

Five Data Preparation Mistakes (and How to Avoid Them!)

After building many predictive models in the Rapid Insight office and helping our customer build many more models outside of the office, we have a list of data preparation mistakes that could fill a room. Here are some of the most common ones we've seen:

1. Including ID Fields as Predictors
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building your model.

2. Using Anachronistic Variables
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect. No predictor variables should be proxies for your dependent variable (ie: “made a gift” = donor, “deposited” = enrolled).

3. Allowing Duplicate Records
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included. It never hurts to dedupe your model file before you start building a predictive model. 

4. Modeling on Too Small of a Population
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in your dataset. The larger your population size is, the most robust your model will be. 

5. Not Accounting for Outliers and/or Missing Values
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when you’re combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm. 

[photo credit]

-Caitlin Garrett, Statistical Analyst at Rapid Insight

Thursday, December 6, 2012

Guide to Rapid Insight Resources

We have created several opportunities for networking with other Rapid Insight users, including a Rapid Insight LinkedIn customers-only group, and several more subject-specific subgroups.

Check out our list of upcoming webinars here. These are a few you might want to check out:

Predictive Modeling (PM) for Higher Ed ** PM for Fundraising
Dashboards and Reporting for Higher Ed ** PM for Healthcare 

Training Resources:

Here are links to a couple of recently featured series:
  • Customer Tips: tips from customers on ways to make your life easier.
  • Creating Variables: on how any why to augment your dataset by creating additional variables using Veera.
  • The Forgotten Tabs: on the benefits of utilizing some of Analytics’ lesser talked about tabs.

Rapid Insight is proud to host an annual User Conference each summer. Information about the conference will be available on our User Conference page as the conference draws near. 

If you have any additional questions about Rapid Insight resources or products, please feel free to contact me directly at

-Caitlin Garrett, Statistical Analyst

Tuesday, December 4, 2012

On Automated Mining

One of the things I love the most about using statistical modeling software (especially Analytics) is that so much of the process is automated. Although automation has made the lives of statisticians much easier (calculating individual standard errors by hand would take hours for each variable), it is still important to be familiar with the methods and thinking that go into the variable selection process. One tab that does a lot of statistical heavy lifting for us is the Automated Mining tab, and I thought it would be good to explore some of the tests that are being used in that tab.

The function of the Automated Mining tab is to determine, variable by variable, which variables are statistically related to the selected y-variable, and which are not. The statistical test will vary from pair to pair depending on the types of variables being compared. One thing that is important to note is that we’re not doing any modeling or looking at the relationships between x-variables yet. The Automated Mining tab and its tests are only deciding which variables have the possibility of being in the predictive model, not which ones will be.

Depending on the types of x- and y-variables involved, one of three tests will be used to decide how related each variable pair is. These possibilities include a Chi-Square test, a Z-test, or an F-test. 

Variable Under Evaluation


Chi-Square Test
A chi-square test is performed for any continuous x-variables used to predict a binary y-variable. In our Automated Mining tab, this test is performed on each of 10 deciles to determine whether or not the ‘ones’ are randomly distributed across the deciles. This test is more robust than using a linear correlation, as it captures non-linear relationships as well as relationships that are not well fit by a curve or line.

A Z-test is used for any binary or categorical predictors, regardless of the type of y-variable they’re trying to predict. It tests whether any category is significantly different in terms of the Y (relative to all other categories).

An F-test is used whenever you have a continuous x-variable trying to predict for a continuous y-variable. In our Automated Mining tab, the data is sorted into deciles and an ANOVA test is run on these deciles to determine if the means are statistically different. This is more robust than a linear correlation, as it captures non-linear relationships and those that do not fit a standard curve.

Once each of these tests is performed at the specified level of significance, we have a narrowed-down dataset of only variables that are statistically related to our y-variable, which brings us one step closer to figuring out which variables historically have the most influence on our y-variable and will end up in our final predictive model. 

-Caitlin Garrett, Statistical Analyst at Rapid Insight