Tuesday, November 27, 2012

Customer Tips From... Jeff Fleischer (Rapid Insight Inc.)

Okay, okay... Those of you who have worked with Jeff know that he isn't really a customer. But, as the Director of Client Operations here at Rapid Insight and an analyst at heart, he is a wealth of information, so I've decided to share some of his tips. They are:

1. The Format Column node was retired a few releases back in favor of the Convert node. Using Convert, change the data type you wish to format to "text", then select the desired style from the node's Format column. PS: The Format Column node is still accessible - try going through the menu Node -> Add Report -> Format Data. 

2. Merge nodes will forget their setup if you disconnect them from their inputs. Placing a Cache node just before the Merge will keep it from forgetting its configuration if you need to relocate or copy it. 

3. Merge nodes also act as a rename. Just edit the text in the Output Column and use the black arrows on the menu bar to rearrange the column order. Bonus tip: select multiple columns before using the black arrows to re-position several at a time. 

4. Use the menu option Edit -> Convert All Columns to Text to do just that. 

5. Dropdown box controls often respond to single letter entries, avoiding actually having to pick from the dropdown list.

6. Select multiple fields using Ctrl-LMouse to Cleanse multiple fields by setting up a single rule/operation. Note that the fields all have to be of the same type (text, integer, date, etc.) for this to work. 

7. Use the new "File Created Column" option in a Combine Inputs node to identify (with a Filter or Dedup) the most recent records coming from a location. 

Tuesday, November 20, 2012

How to Score a Dataset Using Analytics Only

Since we’ve already covered how to score a dataset using Veera, it’s only fair that we show you how to score using the Analytics Scoring program. We’ll start at the point where you save your scoring model within Analytics. After memorizing your model in the Model tab, you’ll want to move down to the Compare Models tab. This tab allows you to compare any two models side-by-side. Once you’ve decided which model you like better, you’re ready to save it by selecting the model and clicking the “Save Scoring Model” as button, as shown below.


Analytics will prompt you to navigate to where you’d like the file to be saved, and will save it with a .rism (Rapid Insight Scoring Model) extension. After saving the .rism file, you’ll want to open the Analytics Scoring Module by going to your Start Menu and navigating to Rapid Insight Inc. -> Analytics -> Scoring, as shown below. 


Once inside the scoring module, you’ll need to click the “Select Dataset” button and navigate to where the dataset you’d like to score is located on your machine. After loading in your dataset, you’ll see all of the variables within it populate the ‘Dataset Variables’ window. Next, you’ll need to click the “Select Scoring Model” button and navigate to where the scoring model (.rism) file you’d like to use is located. Once you find the model, its equation will show up in the corresponding window.


Before you start the scoring process, you have a couple of options detailing how you’d like the model to be scored. The first option, shown above in the green box, allows you to validate the model by looking at the decile analysis resulting from the scoring process. The second option, shown in the blue box, allows you to output the scores as well as the corresponding deciles or percentiles. After you’ve selected the appropriate options, click on the “Start Scoring” button, decide where you’d like your scores to output, and Analytics will score your dataset in the way that you request. 

-Caitlin Garrett, Statistical Analyst at Rapid Insight

Tuesday, November 13, 2012

Predictive Modeling Mantras

Whether you're new to predictive modeling, or you dream in decile analyses, here are some things to keep in mind as you're embarking on your next modeling project: 


Data preparation makes ALL the difference.
Simply put, if you use junk data to create a model, chances are that your model’s output will be junk too. Thus, it’s very important to clean up your data before building your predictive model. During the data clean-up process, you’ll want to think about things like how to handle any missing values, possibilities for new variables to be added to your dataset, how to handle outliers, and make sure that your data is as error-free as possible.

A complex model isn’t the same as a good model.
More often than not, the best model is a simple one. Although you can almost always find a new variable to add, or new way to slice your data, you want to avoid the trap of overfitting. You want your model to be specific, but not so specific that you sacrifice reliability when scoring a new dataset.

A good model validates what you know while revealing what you don’t .
Don’t be surprised if some of your “common sense” variables outperform the more exotic ones. Although it’s always nice to pick up on some new variables and insights, building a predictive model can also boost your confidence in the rest of your data.  

If a model looks perfect, it’s lying.
As exciting as getting a great model fit statistic can be, there is the possibility of too good to be true when it comes to model building. If you build a particularly great model, you’ll want to double and triple check each of the variables in the model to be sure they make sense. One of the most common reasons for a great model is an anachronistic variable – a variable you would have available only after your y-outcome was decided.

Persistence is a virtue (because building models is an iterative process).
After you’ve taken a first pass at a model, maybe you’ll think of a related variable that would be predictive. Maybe you take a second look at some of the relationships between variables and decide to bin or re-map some of your continuous or categorical variables. Maybe the outputted variables are the opposite of what you expected, so you decide to tweak the way your dataset is set up. The point here is that your first model will likely not be your final model. Be ready.

Trust and verify.
The modeling process doesn’t end after you finish building your model. After implementing your predictive model, you want to be sure that it’s correctly predicting your y-variable over time. To do this, you’ll need to compare your model scores with actual results once they are available. If your model is correctly predicting the desired outcome, you can continue to use it (but still must validate as time goes by); otherwise, you’ll need to take a few steps back to see where you can make improvements.


-Caitlin Garrett, Statistical Analyst at Rapid Insight
[photo credit]

Wednesday, November 7, 2012

Subroutines (Customer Post by Tony Parandi)


Today's blog entry comes from Tony Parandi, Assistant Director of Institutional Research at Indiana Wesleyan University: 

One feature of Veera that I’ve found very helpful is the Subroutine node. This node allows you to feed the output of another job directly into the job you’re currently working on. In essence, it allows you to put a job within a job. This is especially helpful if you have a certain data stream that you commonly use, and do not want to rebuild it each time you need it.

An example that I commonly use the Subroutine node for is the recoding of student ethnicity. In 2010, when the Department of Education mandated the new ethnicity categories, we added two additional ethnicity/race fields to our data system (Datatel). Thus, students could have a wide myriad of ethnic category combinations, which means we have to reform these combinations to match the IPEDS definitions. In order to accomplish this I bring in the ethnic data from our warehouse, and use a series of transform nodes to convert the three ethnicity fields into one final ethnic category:

Rather than recreate this stream for every job, I have it saved as a separate job, called “Ethnic Conversion”. As you can see in the picture above, the Output Proxy node is necessary when creating a job for Subroutine purposes as it connects the output data to the Subroutine to the other job you’re building.

Now when I create a new job that needs ethnicity reformatting, I simply bring in a Subroutine node and connect to my data via Student ID in the Merge node.

The output gives me a single ethnic category that matches IPEDS for every student, based on data brought in via the Subroutine.

Although this is a small and simple example of a Subroutine job, the node is a powerful way to connect jobs without having to do copy/pasting or rebuilding. I have found the Subroutine node to be a great time saver, and I encourage everyone to use it whenever possible.

PS: Be sure to check out the rest of our customer tips series here!

Friday, November 2, 2012

NEAIR Presentation: "Four Years of Predictive Modeling and Lessons Learned"

Be sure to catch Dr. Michael Johnson, Director of Institutional Research at Dickinson College,  presenting "Four Years of Predictive Modeling and Lessons Learned" at this year's NEAIR Conference. 


Dr.  Johnson will present an overview of his predictive modeling journey over the past 4 years.  Sharing the many lessons learned, he will outline the various ways predictive modeling has become integrated into the college’s data driven decision making as well as reviewing how the Rapid Insight products and analytic expertise have played an integral role in that process.

The presentation will take place on Monday, November 5th from 2:30 - 3:15 in the Embassy Room.