Rapid Insight: Data Analytics: December 2010

Sunday, December 19, 2010

Veera: Simple Update Notification

Data whirring around is always good. It means there is progress, things are happening, and there is more to be done. Of course, if you have a huge database it doesn’t make much sense to identify the progress by combing through thousands of rows and who knows how many columns to find values that have been added or changed. Instead, wouldn’t it be nice if there were a simple job to spot the changes, large or small, and present them in a simple list? This post will walk you through the construction of a simple Update Notification job, whose sole purpose is to summarize changes in a table.

Setting up the Job

Begin by selecting your input node, a table whose regularly shifting contents you’d like to follow. Perhaps it is where the results of the day’s oxygen-absorption experiments are kept, or maybe it is a running list of email inquiries. Next create an output node, which we’ll call the Cache. This lil’ file is the key to the update system, as it will keep a record of what the input looked like the last time the job was run.

Cute! Run the job to create the output file, then add the new Cache file as a data connection. With your connection defined, drag the Cache input into your job. “But wait, now there are TWO instances of this file!” Yes, it is a little scary, but if you make their names distinct by adding something as simple as “input” and “output,” respectively, everything should be fine.

What we need to do now is compare the old values stored in the Cache with the new values from your primary input table, in this case called “Profile.” Time to add some machinery to this job. In order to make the comparison, we first have to bring the old and new values into the same thread, which we’ll do by creating a Merge node and connecting both inputs. When we join the two sources, the Merge node will add extensions to duplicate column headings in order to keep them distinct (“FolderComplete” and “FolderComplete#2″, for example). This is exactly what we want, but why not make life easier? Place a rename node between the Cache input and the Merge, and for every column you want to compare, add an intuitive extension, such as “…_lastRun”, ie “FolderComplete_lastRun,” “Applied_lastRun,” “Hemoglobin_lastRun,” etc. Veera makes it easy to make life easier.

Open up the Merge node and join the two sources at the identifier key, the value used to uniquely identify each record. In the case of this example it is called “Person_ID.” Right click on the join and select “All records from [primary source] and matching records from [Cache].” This is on the assumption that rows are always being added, but not removed. Select all columns from the primary source and all the renamed columns from the Cache you’d like to compare – it isn’t necessary to bring them all over.

Magnificent. Save and close the node. Now we are ready for the compare itself. Create a Transform node following the Merge. Open up the node and select the “Multi-Variable Formula” tab. From the list at the left, select the values you want to compare, in this case “Ranking” and “Ranking_lastRun.” Give your new variable a name and using the IF function, flag the records where the old and new values are different (“<>” is the symbol for “does not equal”). In this example, the function leaves the new variable empty if the values are the same, and outputs the text string ‘updated’ when they are different.

Click the Update button to save your newly created variable, and you’ll be golden. Use a Filter node to grab all the entries where the new variable called ‘update_status’ equals ‘updated’, and pipe all those fresh records into an output node. The system is almost finished, there is only one more thing to do. Click the Run Order button (

) near the top of the layout window. When you do that, Veera will only display your output nodes, along with a number representing the order in which each node will run. For our job, which has two output nodes, we have to be very particular with our order. Because we don’t want to replace the data in the Cache of old values until AFTER we have compared those old values to the new ones, we want the Cache to receive the new data only after the comparison has happened. How do we do that? Well, we know the output containing our list of updates only outputs after the Transform compare has taken place, so if we have that output run before the Cache output, we will be safe. Click the output nodes in the order in which you want them to run (so click the Update output, then the Cache output). Done!

Click the Run Order button again to return to the regular display, and save your job. Now you have a totally functional Simple Update notifier. Very cool.

Other Tricks

There are a few other useful things you can do with this setup, too. If you wanted to know which rows have been added since the last time the job was run, assuming that the column you’re evaluating is populated for every row, you could check for rows that don’t have a value merged in from the Cache. By changing the join in the Merge node to “all records from both sources,” you could figure out which rows had been removed as well, by scanning for null values in the new data. How handy!

By isolating changes with a setup like this you can do all kinds of cool things beyond merely staying aware of the changes. For example, if you need to manually keep two data sources synchronized you could simplify the task a great deal by limiting the rows you are moving to only those which are new or have changed. I save myself a good hour of time each week by doing just that! Peppered throughout my jobs there are many variations on this theme, useful in a great variety of ways. As a great example, every week in my admissions office I provide a complete list of every candidate within each individual counselor’s territory. Some of these lists are quite lengthy, so I use a somewhat sophisticated Update Notifier to point out important changes that have taken place in a candidate’s application status and communication stream, allowing the counselor to focus their efforts in very deliberate, helpful ways. Beyond making my work far more productive, Veera has done the same for a great number of folks who may never even use the software directly.

-Ryan Moran 12/19/10

Tuesday, December 14, 2010

Welcome to the Official Rapid Insight Blog

Welcome to the official blog of Rapid Insight, Inc., developers of Veera and Analytics, two elegant software solutions for data manipulation and predictive modeling. Here you’ll find all manner of collected goodies, including company announcements and relevant news. We’ll post speed trials of complex feats, technical tutorials for all skill levels, and from time to time you may also see a special guest post, detailing a customer success story.

Accomplishing sophisticated feats of data transformation and analysis has never been more intuitive. Forsake your weak spreadsheet, and make yourself comfortable on a brand new plane of information.

-Ryan Moran 12/14/10

Veera: Maintain Merge Nodes with Dummy Nodes

Imagine you are working on a job with a Merge Node, and on first build you have a very selective and specific list of joins between two sources. All good! Now suppose later on, for whatever reason, you have to make edits to your sources- the schema changes in your csv file, you want to replace one year’s data with that of another, etc. Veera, in its genius, will re-scan the source to account for any column changes. Accordingly, your joins will be reset. This is exactly what you want in most situations, but what if you want to keep your joins the way they were before?

Be not broken-hearted, here is a little trick: in between the merge node and each source, insert a Filter node. There is no need to do anything with the Filter settings, just putting it there is enough. Here is an example:

Because the Filter node will not change when the sources do, the joins in your Merge node will be maintained. Think of it as a bushing for your incoming data, or a simple interconnect for a swappable component. Go ahead and switch your input nodes or modify your column list- so long as the columns you wish to include in the output of the Merge are maintained, your joins will be maintained as well. Brilliant, right?

-Ryan Moran 12/14/10

Webinar: Dr. Mike Johnson on Retention Analysis

Back in early November Dr. Mike Johnson, Director of Institutional Research at Dickinson College, gave a stellar web presentation on how he used RI Analytics to tackle a notoriously challenging issue in higher ed: Retention Modeling. Who is leaving our college? Why? How do we pinpoint those likely to leave and possibly keep them around? How do we adjust our recruitment and more general institutional practices to address what we’ve learned?

Dr. Johnson is something of a legend in the Rapid Insight world, having once used Analytics to predict the size of his small institution’s incoming class within six candidates. His presentation does not disappoint. He explains the reasoning and methodology behind the Dickinson retention scenario, and how Analytics was uniquely helpful in cracking yet another tough research problem. Watch the entire webinar here.

The presentation is exceptionally thorough, detailing the subtle philosophy involved in presenting the issue, the available datapoints and how they may or may not be useful, and the impressive side effects of a problem well solved (such as predicting first year GPA in addition toretention/attrition likelihood). Also, Dr. Johnson does a great job of walking through the Analytics software and explaining the modeling process.

If you are you involved in predictive modeling, quantitative research in higher ed, or are just simply curious about how this stuff works, the video is definitely worth a watch.

-Ryan Moran 12/14/2010