Wednesday, August 14, 2013

Big Data and New Methods

Guest post by Chuck McClenon, Fundraising Scientist from University of Texas at Austin 

When I went to my first APRA Data Analytics Symposium in 2010, the use of analytics in support of philanthropic fundraising was a novelty.   “Analysis”, for most organizations, consisted of descriptive statistics in Excel.  A few pioneers had built regression models, and the Symposium faculty pretty much consisted of those who could explain the differences between Ordinary Linear and Logistic Regression. 

What a difference three years has made!  At this year’s Symposium in Baltimore we considered keyword analysis, hierarchical linear modeling, visualization, and the use of financial industry formulae for portfolio optimization.  We have progressed beyond regression and now have the critical mass of practitioners throwing ideas at each other.  And at many of our institutions we are also accumulating the critical mass of data to support serious mining, and try these new approaches.

Alan Schwartz, formerly with ESPN and more recently the New York Times, gave the keynote address.  Alan had written a series of article for the Times, over several years, examining the incidence of concussions among NFL players, and their long-term effects, including early-onset dementia.  One retired player with dementia at age 50 does not tell a story and the pushback was that there wasn’t enough data, but most of the data is buried in medical records and team records.  The demand for more data was a case of the “better” being the enemy of the “good”.  This one didn’t really require Big Data, it just needed Enough Data.  Early onset dementia is normally extremely rare.  When you have five cases, in a population of only 2000+ retired NFL players, it’s hardly chance.  Schwartz’ exposition is leading to real changes in how head injuries are being regarded in football, down to college, high school, and youth leagues.  Tenacity with data, that’s what analytics is about.

Divah Yap of the University of Minnesota offered an intriguing presentation on scoring the free text in contact reports for words or phrases which may tend to indicate attitude toward the organization.  We have a lot of usable data around us, if we know how to decompose it and connect dots.  When we have enough data, well-organized, we can understand it in ways we never could before.

Visualization may be coming of age as part of analysis. One of our fundraising projects here at UT which we have mostly failed at so far is to find donors for the Texas Advancement Computing Center (TACC) and its visualization lab.  But if we can’t help them, maybe they can help us.  In a few weeks we’re going to get together with them, and hand them the keys to our data warehouse, and see if they can paint it in colors we never imagined, and help us to see it in ways that the numbers alone don’t tell us.

In my college class on linear methods, we were warned strictly against correlation fishing.  In your typical experiment in human psychology, p < .05 is the standard, and if you run your experiments on twenty or fifty or even a hundred subjects, getting past p < .05 can be a challenge.  And of course the measurement “P = .05” means that there is a one in twenty chance that the conclusion is wrong.  Run ten such studies, and there’s a 40% likelihood that at least one of your conclusions, if not more, will be wrong.  

Taken a different way, and this is where the dictum against correlation fishing comes in, if you have a file with ten independent variables, and you threw it into a correlation matrix, there would be 45 pairs of variables to correlate, and if you set your standard going in as p < .05, then from those 45 pairings you could expect to draw two false conclusions.  Try it on a file of twenty variables or more, with hundreds of combinations to test, and there is a real risk that the apparent correlations are simply the random noise in the sample, and are as much a reflection of tides and astrology as they are of anything causative within the population.   And with more variables thrown into the mix, there is also the increasing risk of multi-collinearity if your variables are in fact numerically related in their derivations. 

But when we study donor behavior in large organizations, we move beyond the realm of the psychology lab and limited sample sizes.  The University of Texas at Austin has a constituent database of over 500,000 alumni and friends.  I have decades of gift history, and I have acquired consumer behavior information, derived from point-of-sale and other sources.   People with cats give more to the arts, people with dogs give more to athletics, but in the end their total giving is similar.  I can say this “with confidence”, when  p < .0001.  Big Data tells us stories, and illustrates them in color.  This doesn’t mean that I can operationalize any strategy dependent on dogs and cats --  especially never depend on cats – but it does give us new insights.

Coming back to the conference, if there are a half-dozen presenters offering totally novel approaches to analysis, then the probability is fairly high that any one of them may be a total waste of time, but there’s a pretty good chance that at least one or two of them contain real nuggets.  That’s the nature of data mining, and it’s also why we go to conferences, to look for new insights, which may or may not be usable.  Coming away from this year’s Symposium, many of us are feeling almost overwhelmed by new ideas, and just wishing we had the time needed to explore all of them. 

Big Data?  How Big is big enough, and how much is too big?  That’s becoming a difficult question, and the boundaries of privacy will be a philosophical argument for years to come.  I’ve reached the unscientific conclusion that market segmentations such as Claritas or PersonicX clusters are dead on the money 85% of the time, a little bit off 10% of the time, and absolutely wrong 5% of the time.  When there’s so much data around, and They seem to have such a complete picture of the individual, is it comforting to know that some of it is probably wrong, and so the picture that They have of us isn’t as accurate as we’re afraid?   When I talk about cat owners and dog owners, should you be shocked that I know so much about my constituents, or shocked that I draw conclusions from such imperfect data?  Perhaps both, but Big Data is becoming reality, and so we will learn to use it for what it is, to use it wisely and respectfully.

Organize, transform, restructure, build a systematic repository.  Mine for connections.  And if a you don’t have a supercomputer for your visualization, Tableau may take you a long way.

*
About Chuck:  Chuck McClenon arrived at the University of Texas at Austin in 1975, earned a PhD in linguistics, dabbling in the nascent technology of pattern recognition.  After a year teaching in English in China, he returned to UT to work in administrative information management, searching for patterns and meaning in data ranging from student course registrations to library book titles to the bit-paths of room keys.  He joined the advancement operation as an IT manager in 1996 at the start of UT’s first comprehensive capital campaign. After a brief tour of duty managing the gift processing and donor records operation, he retired to a cave and immersed himself in phonathon results and gift officer contact reports. Now he spends his days acquiring, constructing, managing and analyzing data representing the full spectrum of advancement activity.  Since 2006, he has held the official title of Fundraising Scientist.

No comments:

Post a Comment