Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Friday, September 5, 2014

Comparative Performance in Predictive Coding: you miscalculated comparative performance; the bear eats you

There's an old joke about two campers who are awakened in their tent one morning by the sound of an approaching bear.  Upon seeing them, the bear charges.  One camper takes off running.  The other begins putting on his boots.  "Why are you doing that?  You can't outrun a bear!" says the runner.  The other looks up and says, "I don't have to outrun the bear, I only have to outrun you."  To the extent it is funny, it is so because it re-frames the analysis of how to envision a successful outcome.

A different kind of joke about comparative performance has been played on the courts with the promulgation of the claim of superior machine performance in ediscovery.

It goes something like this:  eDiscovery is never perfect. Determining the appropriateness of a novel information retrieval technology could by thorny.    However, in response to court concerns as to whether predictive coding discovery approaches should be accepted for use, "researchers" have serendipitously "established" that the new machine approaches are not only cheaper, but critical to the question of acceptability, better than manual review.

So the argument has gone, courts only need to accept that predictive coding "runs better" than manual review to accept its use, and don't need to surpass some other more troublesome performance benchmark (roughly speaking, the bear's speed).

But an objective analysis of the available data indicates that the comparative performance assertion is supported by-- well, trade association of boot sellers, as it were. As I've noted before, these tests, relied upon to support their claims, are heavily flawed.  At the very least, shouldn't the performance claims be looked at carefully by courts in this kind of situation?

To avoid being eaten -- as it were.

Wednesday, August 27, 2014

eDiscovery "Dark Lakes": Beyond Content-Centric Analytics for Predictive Coding

In examining the Enron corpus, what becomes clear is the sparsity of meaningful terms in email communications.  With boiler plate and noise words remove, more than have of the email corpus only has a handful of terms.   This I have confirmed with many other document sets.  In addition, in most collections, emails represent a large majority of documents.  These observations, taken together, establish that when predictive coding is employed, there are often ediscovery dark lakes of unanalyzed documents that will be quietly overlooked.

Because statistically-based pattern detection methods pretty much universally used in predictive coding perform increasingly erratically as documents become terse -because there is very little information to distinguish noise from signal - these approaches are reduced to behaving like Boolean keyword searches i.e. the test comes down to whether a significant "hit word" is or is not present.

It may well be that current content-based approaches are not effectively analyzing a significant plurality of document collections.  Of course, because of the work flows employed, these documents are being ignored in a very discreet manner.  @#$ that happens in a predictive coding engine, stays in the predictive coding engine, as it were.

Content-based systems alone won't solve this problem.  Context-based analytics and annotated systems can, though.

Tuesday, August 26, 2014

Looking Beyond Content to Context in Predictive Coding: "Graph-based Text Classification: Learn from Your Neighbors"

The research piece is from the 2006 ACM SIGIR Conference and discusses techniques to enhance "predictive" (really probabilistic) information classification, e.g. web page classification, using context information to augment content-based approaches.

The article provides a summary of the rationale for the research:

“[The classifier model] learn[s] parameters of mathematical decision models such as Support Vector Machines, Bayesian classifiers, or decision trees, based on intellectually labeled training data. When the trained classifier is later presented with test data with unknown category labels, the standard paradigm is to apply the decision model to each data item in a ‘context-free’ manner: the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set.

In many settings, this ‘context-free’ approach does not exploit the available information about relationships between data items. For example, if we are asked to classify a book into a genre, we can take advantage of some additional information like the genre of other books by the same author or what other books were bought by the readers of this one.  Similarly, the hyperlink neighbors of a Web page give us clues about the topic of a page.”

Novel approaches like this one have application to ediscovery document classification, especially to all important communication documents such as emails.  Expect analogous advances in predictive coding, by Topiary Discovery if not others, in the not too distant future.

Thursday, August 21, 2014

Predictive Coding, Document Type and Relevance Issues

A little over 2 years ago, in Sampling with Herb, I mentioned the phenomenon of relevance differences based upon document type:

"3.  In my experience, how documents are responsive to any particular issue correlates somewhat with the document type (emails vs. spreadsheets vs. text documents).  Should stratified sampling based on document type be used first in the seed document set going into the predictive engine to optimize the identification of small clusters of relevant documents of an infrequent document type?  And should stratified sampling be used again to examine the no-hit population?  And if one accepts that in principle this should be done, is there a rational process to determine when and how it would be implemented?"

Reading some of the current blog from the ediscovery information retrieval corner, it looks like the topic has surfaced - I suppose one would say  - as a "cutting edge" area.

I've built models based upon document type and what I've concluded is that they are fairly cumbersome.  

Let's see what the research community has to say in a couple of years.

Friday, August 8, 2014

Copyright Litigation Analytics Research

"One of the most important ways to measure the impact of copyright law is through empirical examination of actual copyright infringement cases [...] allow[s] us to examine a wide variety of copyright issues, such as the rate of settlements versus judgments; the incidence of litigation between major media companies, small firms, and individuals; the kinds of industries and works involved in litigation; the nature of the alleged infringement; the success rates of particular parties and claims; and the nature of remedies sought and awarded [...] to identify ways in which copyright litigation differs from other civil suits and to show that certain plaintiff characteristics are more predictive of success."

Interesting analytics application research can be found here.

Tuesday, July 29, 2014

Predictive Coding and Predictive Information Governance Classification are Not the Same Thing - Topiary Discovery, originally published 2013

Potential customers are now being bombarded by claims about re-purposing predictive coding tools for information governance.  While there is definitely overlap, the objectives of the two tasks are not equivalent.  The differences require distinctly different analytic capabilities.

For example, many predictive coding tools are suited primarily for binary classification, relevant vs. non-relevant or privileged vs non-privileged.  Moreover, the workflow is designed for iterative linear model building.  Many systems have no capabilities to comprehensively capture and reuse institutional knowledge about document taxonomies that can dramatically aid in information governance.

Moreover, a robust classification system must allow for individual information items to be classified into more than one category.  An email can be subject to a specific legal hold, a financial regulatory retention rule and document containing valuable client feedback. A document may be tagged because it contains personally identifiable information and separately because it is both subject to non-disclosure agreement requirement and is considered a legal record.

Finally, information governance systems must make much greater use of various forms of metadata than current predictive coding solutions analyze.

Therefore, generally promoting standard predictive coding solutions as an information governance tool is a facile claim.  Buyers should not be dazzled by the claims of superior analytic technologies but rather should consider carefully the capabilities of predictive coding solutions against their real information governance needs.  Consider testing proposed solutions on a few discrete disparate projects to see if the solution has value, and whether it captures institutional insights in a manner that lends itself to broader enterprise usage.

(c) 2014  Originally published 12/31/13.