Rapid Insight: Data Analytics: January 2011

Sunday, January 16, 2011

Veera: Email Address Corrector

Veera is an ideal solution for data processing, integration, and analysis, but it also can be a powerful platform for miscellaneous error-correction and markup utilities. It is possible to replace an entire fleet of data-scrubbing interns, laboring over lists for a week looking for small data entry errors (and subject to errors themselves), with one flawless Veera job that runs in less than fifteen minutes.

In this example, I’ll detail an email address corrector I made for Bennington College. Even when purchasing prospective contacts from high-profile providers, the email addresses are only as good as the original entry and transfer. Because my coworkers and I found a high incidence of certain kinds of misspellings, I created a Veera job to find and replace the erring addresses with viable ones. Some addresses can’t be salvaged, and the job makes it easy to isolate, inspect, and if necessary discard those entries as well. By doing this we save time and money by not sending materials to those who are certain not to receive them, and at the same time we reduce our email bounce-back rate.

The job consists of an input source which can very easily be swapped, a series of sequential transform operations (all performed on the ‘Email’ column), and a clean output list, which can also be changed quite easily. Here’s how it looks:

Prior to running this utility, in another job I merge the purchased list with a list of inquiries from our local database. The first transform node in this job, “Remove Admiss Flags,” scrubs some remaining markers that our local DB inserts into the email string of certain candidates. That done, we get into the real work.

The two real challenges of this job are that email addresses can’t be modified in place, and the “handle” (as opposed to the “@hotmail.com” part of the address, which I’ve been referring to as the “service”) is highly variable. In order to combat both of these troubles, I split the address into two parts using the ‘@’ character to delimit. If an address does not have an ‘@’ character it isn’t usable, so that row is flagged as erroneous.

With the email separated into two parts, I can send the service string into the garage for all kinds of typographic repairs. The six most common errors are: hotnail.com (hotmail), gnail.com (gmail), yaho.com (yahoo), .ed (edu), .co (com), and .or (org). If you have seen some others, let us know! More brains are always better for challenges like these.

When the cleaned up service string is ready, it is recombined with the user handle using the ‘@’, resulting in a perfectly functional email address. A rename/exclude node is used to remove all of the secondary columns created in processing the data, then the list is output & ready to be uploaded to the email server.

If you are curious and would like to see the actual transform functions, have an exported version of the job emailed to you, or have suggestions for similar utilities, please comment!

-Ryan Moran 1/16/11