Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®

Saturday, April 14, 2012

Redefine Transparency in Predictive Coding: Shoot for Validity



A lot of buzz in the predictive coding (I’ve let go of the semantic debate about what PC is and is not; it’s like art, I know it when I see it) over transparency.  It seems to have become the "fortified by wholesome goodness"seal for predictive coding.    

First of all, frankly, I find the use of the term a bit jarring, given what under-girds predictive coding.  Machine learning and predictive clustering techniques are from a practitioners standpoint, at their core highly opaque.  Most regimes employ some version of statistical text mining techniques (LSA, pLSA, LDA, and other clustering techniques) and then some form of predictive supervised learning technique (like a support vector machine).  

To illustrate, here’s a snippet from Wikipedia describing latent semantic analysis, a basic text mining approach:











Here's another concerning support vector machines:


Transparency does not leap to the mind.  Even someone with postdoctoral work in machine learning would be hard pressed to deconstruct the math that accurately explains why any given set of documents received one predicted relevance and confidence score while another received another.  I suppose they could do it, given time, but the resulting explanation would make most of our noses bleed. In fact, when pressed on the specifics of predictive coding, it’s been my experience that the response is repetition of marketing jargon that is sometimes reminiscent of the discussion of Brawndo and electrolytes.

So predictive coding doesn't conjure up the term transparency to me.

The transparency campaign that is being so furiously waged by the industry to assuage fears really has to do with the transparency of the testing done around the predictions.  And that's a good focal point.  But this kind of "transparency" only has value to the extent that it provides the traditional testing assurances of both reliability and validity.   

Right now, I’d argue that proposed testing processes are assuring reliability - the consistency of testing processes.  However, I'd argue that considerations about validity - the degree to which testing supports the conclusions drawn from the testing - has been given short shrift.  

So it may be wise to think more specifically about validity issues when considering how transparent a proposed predictive coding process is.

No comments:

Post a Comment