Friday, August 30, 2013

Here's to the Skeptics: Addressing Predictive Modeling Misconceptions

Photo credit: Jonny Goldstein
As a full-time analytics professional, I have a hard time conceiving of people who have not fully embraced the power of predictive analytics, but I know they’re out there and I think it’s important to address their concerns. In doing so, I’m not here to argue that predictive analytics is a perfect fit for every organization. Predictive analytics requires investment: in your data, in infrastructure and technology, and of your time. It’s also an investment in your company, your internal knowledge base, and your future. I’m here to argue that the investment is worth it. 

To do so, I’ve presented a few clarifications to address predictive modeling concerns that I’ve heard from skeptics. If you have anything to add, or if there are any big concerns I’ve missed, let me know in the comments.

You don’t need to be a PhD statistician to build predictive models
A working knowledge of statistics will help you to better interpret the results of predictive models, but you don’t need ten years’ experience or a doctorate degree to glean insight or utilize the output from a model. There are software packages out there with diagnostics that can help you understand which variables are important, which are not, and why. Knowing your data is equally important as statistical knowledge, and both will serve you well in the long run. 

A predictive model shouldn’t be a black box
There are plenty of companies and consultants whose predictive models could fall into the “black box” category.  The model building process, in this case, involves sending your data to an outside party who analyzes it and returns you a series of scores. On the surface, this may not seem like a bad thing, but once you’ve built your first model, you’ll understand why this is not nearly as valuable as doing it yourself. While the output scores are important, you also want to know about the variables used, how the model handled any missing or outlying variables, and glean insight beyond a single set of scores so that you can change or monitor specific behaviors going forward.

Even if you know your data, modeling can help
A finished predictive model will do one of two things: confirm what you’ve always believed, or bring new insights to light. In our office, we refer to this idea as “turn or confirm” – a model will either turn or confirm the things you’ve thought to be true. Most of the time, models will do both. This allows you to both validate any anecdotal evidence you might have (or realize that correlations might not be as strong as you thought) and take a look at new variables or connections that you may not have picked up on before. 

Predictive models can be implemented quickly
I've heard some horror stories about a model taking months, or even years, to implement. If this is the case at your institution, you're doing it wrong. At this point, predictive modeling software has become incredibly efficient - usually able to turn out models within seconds or minutes. The bulk of time spent working on a model is typically spent on the data clean-up, which will vary from company to company. In any case, this is time well spent. Clean data is just as good for reporting, dashboarding, and visualizing as it is for predictive modeling.

Predictive models enhance human judgment, not replace it
If models were meant to replace human judgment, I too would be uncomfortable and suspicious of the idea. However, 99% of the time, the aim of predictive modeling is to enhance and expand human expertise to allow us (the end users) to be better-informed and more data-driven in our decision making.

-Caitlin Garrett, Statistical Analyst at Rapid Insight

Wednesday, August 14, 2013

Big Data and New Methods

Guest post by Chuck McClenon, Fundraising Scientist from University of Texas at Austin 

When I went to my first APRA Data Analytics Symposium in 2010, the use of analytics in support of philanthropic fundraising was a novelty.   “Analysis”, for most organizations, consisted of descriptive statistics in Excel.  A few pioneers had built regression models, and the Symposium faculty pretty much consisted of those who could explain the differences between Ordinary Linear and Logistic Regression. 

What a difference three years has made!  At this year’s Symposium in Baltimore we considered keyword analysis, hierarchical linear modeling, visualization, and the use of financial industry formulae for portfolio optimization.  We have progressed beyond regression and now have the critical mass of practitioners throwing ideas at each other.  And at many of our institutions we are also accumulating the critical mass of data to support serious mining, and try these new approaches.

Alan Schwartz, formerly with ESPN and more recently the New York Times, gave the keynote address.  Alan had written a series of article for the Times, over several years, examining the incidence of concussions among NFL players, and their long-term effects, including early-onset dementia.  One retired player with dementia at age 50 does not tell a story and the pushback was that there wasn’t enough data, but most of the data is buried in medical records and team records.  The demand for more data was a case of the “better” being the enemy of the “good”.  This one didn’t really require Big Data, it just needed Enough Data.  Early onset dementia is normally extremely rare.  When you have five cases, in a population of only 2000+ retired NFL players, it’s hardly chance.  Schwartz’ exposition is leading to real changes in how head injuries are being regarded in football, down to college, high school, and youth leagues.  Tenacity with data, that’s what analytics is about.

Divah Yap of the University of Minnesota offered an intriguing presentation on scoring the free text in contact reports for words or phrases which may tend to indicate attitude toward the organization.  We have a lot of usable data around us, if we know how to decompose it and connect dots.  When we have enough data, well-organized, we can understand it in ways we never could before.

Visualization may be coming of age as part of analysis. One of our fundraising projects here at UT which we have mostly failed at so far is to find donors for the Texas Advancement Computing Center (TACC) and its visualization lab.  But if we can’t help them, maybe they can help us.  In a few weeks we’re going to get together with them, and hand them the keys to our data warehouse, and see if they can paint it in colors we never imagined, and help us to see it in ways that the numbers alone don’t tell us.

In my college class on linear methods, we were warned strictly against correlation fishing.  In your typical experiment in human psychology, p < .05 is the standard, and if you run your experiments on twenty or fifty or even a hundred subjects, getting past p < .05 can be a challenge.  And of course the measurement “P = .05” means that there is a one in twenty chance that the conclusion is wrong.  Run ten such studies, and there’s a 40% likelihood that at least one of your conclusions, if not more, will be wrong.  

Taken a different way, and this is where the dictum against correlation fishing comes in, if you have a file with ten independent variables, and you threw it into a correlation matrix, there would be 45 pairs of variables to correlate, and if you set your standard going in as p < .05, then from those 45 pairings you could expect to draw two false conclusions.  Try it on a file of twenty variables or more, with hundreds of combinations to test, and there is a real risk that the apparent correlations are simply the random noise in the sample, and are as much a reflection of tides and astrology as they are of anything causative within the population.   And with more variables thrown into the mix, there is also the increasing risk of multi-collinearity if your variables are in fact numerically related in their derivations. 

But when we study donor behavior in large organizations, we move beyond the realm of the psychology lab and limited sample sizes.  The University of Texas at Austin has a constituent database of over 500,000 alumni and friends.  I have decades of gift history, and I have acquired consumer behavior information, derived from point-of-sale and other sources.   People with cats give more to the arts, people with dogs give more to athletics, but in the end their total giving is similar.  I can say this “with confidence”, when  p < .0001.  Big Data tells us stories, and illustrates them in color.  This doesn’t mean that I can operationalize any strategy dependent on dogs and cats --  especially never depend on cats – but it does give us new insights.

Coming back to the conference, if there are a half-dozen presenters offering totally novel approaches to analysis, then the probability is fairly high that any one of them may be a total waste of time, but there’s a pretty good chance that at least one or two of them contain real nuggets.  That’s the nature of data mining, and it’s also why we go to conferences, to look for new insights, which may or may not be usable.  Coming away from this year’s Symposium, many of us are feeling almost overwhelmed by new ideas, and just wishing we had the time needed to explore all of them. 

Big Data?  How Big is big enough, and how much is too big?  That’s becoming a difficult question, and the boundaries of privacy will be a philosophical argument for years to come.  I’ve reached the unscientific conclusion that market segmentations such as Claritas or PersonicX clusters are dead on the money 85% of the time, a little bit off 10% of the time, and absolutely wrong 5% of the time.  When there’s so much data around, and They seem to have such a complete picture of the individual, is it comforting to know that some of it is probably wrong, and so the picture that They have of us isn’t as accurate as we’re afraid?   When I talk about cat owners and dog owners, should you be shocked that I know so much about my constituents, or shocked that I draw conclusions from such imperfect data?  Perhaps both, but Big Data is becoming reality, and so we will learn to use it for what it is, to use it wisely and respectfully.

Organize, transform, restructure, build a systematic repository.  Mine for connections.  And if a you don’t have a supercomputer for your visualization, Tableau may take you a long way.

About Chuck:  Chuck McClenon arrived at the University of Texas at Austin in 1975, earned a PhD in linguistics, dabbling in the nascent technology of pattern recognition.  After a year teaching in English in China, he returned to UT to work in administrative information management, searching for patterns and meaning in data ranging from student course registrations to library book titles to the bit-paths of room keys.  He joined the advancement operation as an IT manager in 1996 at the start of UT’s first comprehensive capital campaign. After a brief tour of duty managing the gift processing and donor records operation, he retired to a cave and immersed himself in phonathon results and gift officer contact reports. Now he spends his days acquiring, constructing, managing and analyzing data representing the full spectrum of advancement activity.  Since 2006, he has held the official title of Fundraising Scientist.