Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Saturday, October 18, 2014

William Webber Posts: A Bean Counter's View of Training Approaches in Predictive Coding

In two posts, one and two, text analytics expert and researcher William Webber discusses a model, and revised model, to estimate the relative comparable costs of four different training methods employed in analytics-enabled ediscovery.

Worth the read.  The second post displays some spreadsheet views of per unit review costs as prevalence varies.  More fundamentally, it promotes the notion that how you conduct your pre-culling (if that is done) and review may impact the type of tool you may wish to "rent" from a vendor, and how altering the predictive coding workflow can have significant cost implications.

Wednesday, September 24, 2014

From Deloitte, A White Paper on Tackling Complex Investigations Using Technology-Assisted, Human-Driven Methods

A very informative whitepaper from Deloitte, Investigative analytics - Enhancing investigative capabilities for cross-jurisdictional and transnational crimes, presents the issues related to the investigation of complex activities like transnational human trafficking, and the challenge of providing the right interplay between human and machine to aid in combating them.  

The gist of that interplay:

“Instead of relying on technology alone to break a case, the investigative analytics methodology uses advanced analytical tools to enhance the essential human element: the investigator’s instinct.”

Tuesday, September 23, 2014

The Two Deadly Sins Hobbling Technology-enabled Solutions in eDiscovery

There are some interesting observations about predictive coding’s “failure to launch”.  Here are two recommendations that mitigate most if not all of the impediments to technology-enabled solutions: inordinate cost and unsatisfactory performance.

Cost: Reduce Prices
Most knowledgeable people in the industry are aware that predictive coding vendors adopted a discount-off-of-the-alternative-manual-review-cost pricing model that assured hefty profits but bore no rational relationship to their own actual costs.  Not only did this in some cases result in no actual savings; most importantly, it has occasioned the re-introduction of a process that advocates promised it would replace: key word culling. As the recall metrics in the Biomet case establishes, even key word culling done well reduces recall in a very blunt way.  Moreover, the current trumpeting of key word culling prior to predictive coding use, in some cases advocated by the very same people who condemned its use in order to tout predictive coding early has significantly damaged the credibility of predictive coding advocates.  This has not helped the cause of technology-enabled solutions.
Fundamentally changing the vendor pricing model in a way that de-couples document collection size from price would both decrease the price tag for predictive coding use generally and would obviate the need for key word culling before predictive coding usage in many cases. 
In addition, it would return some of the reduced ediscovery revenue to law firms through the practices that could be adopted in recommendation #2 because vendors are not squeezing as much profit as theoretically possible from the technology solution.

Inertia: Make the Technology and the Processes Better
Invest in developing better technology solutions and develop formal processes to achieve better performance measures. Currently, technology and service vendors, as well as legal and consulting TAR experts appear to be entirely invested in promoting the notion that predictive coding performance is superior to that of traditional review.  In my experience, and reports of user reluctance supports this, users aren’t buying those claims.  Technologies and techniques to reduce the risk of missing important information in a production exist; however, it appears that the industry in main is dedicated to exacting profit by aggressively making somewhat dubious performance claims and aggressively marketing existing solution platforms, rather than investing in better more cost effective solutions that user concerns.
Recognition of current limitations and an improved approach would have the added benefit of removing reliance on subject matter “experts” and claims of secret skill sets.  In effect, it would rationalize the process by permitting the use of standard methodologies and transparent reporting of results of process steps.  It would also serve to militate against successful opposing party objections to technology enabled discovery as well as judicial reluctance to approving its use.
Moreover, as mentioned above, the reduction in cost will allow law firms to recoup some of the (admittedly reduced) revenue in the enhanced more manually driven quality assurance steps in the process.  This, in turn, at the very least removes some of the law firm disincentives to considering technology-centric solutions.

Friday, September 5, 2014

Comparative Performance in Predictive Coding: you miscalculated comparative performance; the bear eats you

There's an old joke about two campers who are awakened in their tent one morning by the sound of an approaching bear.  Upon seeing them, the bear charges.  One camper takes off running.  The other begins putting on his boots.  "Why are you doing that?  You can't outrun a bear!" says the runner.  The other looks up and says, "I don't have to outrun the bear, I only have to outrun you."  To the extent it is funny, it is so because it re-frames the analysis of how to envision a successful outcome.

A different kind of joke about comparative performance has been played on the courts with the promulgation of the claim of superior machine performance in ediscovery.

It goes something like this:  eDiscovery is never perfect. Determining the appropriateness of a novel information retrieval technology could by thorny.    However, in response to court concerns as to whether predictive coding discovery approaches should be accepted for use, "researchers" have serendipitously "established" that the new machine approaches are not only cheaper, but critical to the question of acceptability, better than manual review.

So the argument has gone, courts only need to accept that predictive coding "runs better" than manual review to accept its use, and don't need to surpass some other more troublesome performance benchmark (roughly speaking, the bear's speed).

But an objective analysis of the available data indicates that the comparative performance assertion is supported by-- well, trade association of boot sellers, as it were. As I've noted before, these tests, relied upon to support their claims, are heavily flawed.  At the very least, shouldn't the performance claims be looked at carefully by courts in this kind of situation?

To avoid being eaten -- as it were.

Wednesday, August 27, 2014

eDiscovery "Dark Lakes": Beyond Content-Centric Analytics for Predictive Coding

In examining the Enron corpus, what becomes clear is the sparsity of meaningful terms in email communications.  With boiler plate and noise words remove, more than have of the email corpus only has a handful of terms.   This I have confirmed with many other document sets.  In addition, in most collections, emails represent a large majority of documents.  These observations, taken together, establish that when predictive coding is employed, there are often ediscovery dark lakes of un-analyzed documents that will be quietly overlooked.

Because statistically-based pattern detection methods pretty much universally used in predictive coding perform increasingly erratically as documents become terse -because there is very little information to distinguish noise from signal - these approaches are reduced to behaving like Boolean keyword searches i.e. the test comes down to whether a significant "hit word" is or is not present.

It may well be that current content-based approaches are not effectively analyzing a significant plurality of document collections.  Of course, because of the work flows employed, these documents are being ignored in a very discreet manner.  @#$ that happens in a predictive coding engine, stays in the predictive coding engine, as it were.

Content-based systems alone won't solve this problem.  Context-based analytics and annotated systems can, though.

Tuesday, August 26, 2014

Looking Beyond Content to Context in Predictive Coding: "Graph-based Text Classification: Learn from Your Neighbors"

The research piece is from the 2006 ACM SIGIR Conference and discusses techniques to enhance "predictive" (really probabilistic) information classification, e.g. web page classification, using context information to augment content-based approaches.

The article provides a summary of the rationale for the research:

“[The classifier model] learn[s] parameters of mathematical decision models such as Support Vector Machines, Bayesian classifiers, or decision trees, based on intellectually labeled training data. When the trained classifier is later presented with test data with unknown category labels, the standard paradigm is to apply the decision model to each data item in a ‘context-free’ manner: the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set.

In many settings, this ‘context-free’ approach does not exploit the available information about relationships between data items. For example, if we are asked to classify a book into a genre, we can take advantage of some additional information like the genre of other books by the same author or what other books were bought by the readers of this one.  Similarly, the hyperlink neighbors of a Web page give us clues about the topic of a page.”

Novel approaches like this one have application to ediscovery document classification, especially to all important communication documents such as emails.  Expect analogous advances in predictive coding, by Topiary Discovery if not others, in the not too distant future.