Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Thursday, January 29, 2015

Benford's Law and Email Patterns: an Observation

Benford’s Law is a well known phenomenon to those involved in fraud investigations. It is named after physicist Frank Benford who was one of the early reporters of the distribution. The distribution of first digits, according to Benford's law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. For those interested, Radiolab did a nice piece on Benford’s law that can be viewed here. The distribution of first digits in many sets of naturally occurring number sets as predicted by Benford is as follows:
It might be of value to determine if email communication patterns are amenable to Benford Law analysis. Indeed, Topiary was requested to consider whether email distributions conformed in any way to Benford’s Law.

 Here's a short synopsis. We first looked at approximately 165000 publicly available Enron emails from 56 custodians. Some descriptive analytics are as follows:
   Interestingly, at least with this email set, the first digit analysis of the daily totals for all custodians is somewhat consistent with Benford's Law, as shown here: 

 But what about each custodian? Do their first digit totals mirror the aggregate, or is there a distribution curve? The answer is displayed here: 

So not all custodian email totals adhere to Benford's Law. It would be of interest to understand why certain custodians are way out on either end of the normal distribution curve. It may be simple random variation, or it may be indicative of some underlying characteristic.

That's a topic for another day...

Saturday, October 18, 2014

William Webber Posts: A Bean Counter's View of Training Approaches in Predictive Coding

In two posts, one and two, text analytics expert and researcher William Webber discusses a model, and revised model, to estimate the relative comparable costs of four different training methods employed in analytics-enabled ediscovery.

Worth the read.  The second post displays some spreadsheet views of per unit review costs as prevalence varies.  More fundamentally, it promotes the notion that how you conduct your pre-culling (if that is done) and review may impact the type of tool you may wish to "rent" from a vendor, and how altering the predictive coding workflow can have significant cost implications.

Wednesday, September 24, 2014

From Deloitte, A White Paper on Tackling Complex Investigations Using Technology-Assisted, Human-Driven Methods

A very informative whitepaper from Deloitte, Investigative analytics - Enhancing investigative capabilities for cross-jurisdictional and transnational crimes, presents the issues related to the investigation of complex activities like transnational human trafficking, and the challenge of providing the right interplay between human and machine to aid in combating them.  

The gist of that interplay:

“Instead of relying on technology alone to break a case, the investigative analytics methodology uses advanced analytical tools to enhance the essential human element: the investigator’s instinct.”

Tuesday, September 23, 2014

The Two Deadly Sins Hobbling Technology-enabled Solutions in eDiscovery

There are some interesting observations about predictive coding’s “failure to launch”.  Here are two recommendations that mitigate most if not all of the impediments to technology-enabled solutions: inordinate cost and unsatisfactory performance.

Cost: Reduce Prices
Most knowledgeable people in the industry are aware that predictive coding vendors adopted a discount-off-of-the-alternative-manual-review-cost pricing model that assured hefty profits but bore no rational relationship to their own actual costs.  Not only did this in some cases result in no actual savings; most importantly, it has occasioned the re-introduction of a process that advocates promised it would replace: key word culling. As the recall metrics in the Biomet case establishes, even key word culling done well reduces recall in a very blunt way.  Moreover, the current trumpeting of key word culling prior to predictive coding use, in some cases advocated by the very same people who condemned its use in order to tout predictive coding early has significantly damaged the credibility of predictive coding advocates.  This has not helped the cause of technology-enabled solutions.
Fundamentally changing the vendor pricing model in a way that de-couples document collection size from price would both decrease the price tag for predictive coding use generally and would obviate the need for key word culling before predictive coding usage in many cases. 
In addition, it would return some of the reduced ediscovery revenue to law firms through the practices that could be adopted in recommendation #2 because vendors are not squeezing as much profit as theoretically possible from the technology solution.

Inertia: Make the Technology and the Processes Better
Invest in developing better technology solutions and develop formal processes to achieve better performance measures. Currently, technology and service vendors, as well as legal and consulting TAR experts appear to be entirely invested in promoting the notion that predictive coding performance is superior to that of traditional review.  In my experience, and reports of user reluctance supports this, users aren’t buying those claims.  Technologies and techniques to reduce the risk of missing important information in a production exist; however, it appears that the industry in main is dedicated to exacting profit by aggressively making somewhat dubious performance claims and aggressively marketing existing solution platforms, rather than investing in better more cost effective solutions that user concerns.
Recognition of current limitations and an improved approach would have the added benefit of removing reliance on subject matter “experts” and claims of secret skill sets.  In effect, it would rationalize the process by permitting the use of standard methodologies and transparent reporting of results of process steps.  It would also serve to militate against successful opposing party objections to technology enabled discovery as well as judicial reluctance to approving its use.
Moreover, as mentioned above, the reduction in cost will allow law firms to recoup some of the (admittedly reduced) revenue in the enhanced more manually driven quality assurance steps in the process.  This, in turn, at the very least removes some of the law firm disincentives to considering technology-centric solutions.

Friday, September 5, 2014

Comparative Performance in Predictive Coding: you miscalculated comparative performance; the bear eats you

There's an old joke about two campers who are awakened in their tent one morning by the sound of an approaching bear.  Upon seeing them, the bear charges.  One camper takes off running.  The other begins putting on his boots.  "Why are you doing that?  You can't outrun a bear!" says the runner.  The other looks up and says, "I don't have to outrun the bear, I only have to outrun you."  To the extent it is funny, it is so because it re-frames the analysis of how to envision a successful outcome.

A different kind of joke about comparative performance has been played on the courts with the promulgation of the claim of superior machine performance in ediscovery.

It goes something like this:  eDiscovery is never perfect. Determining the appropriateness of a novel information retrieval technology could by thorny.    However, in response to court concerns as to whether predictive coding discovery approaches should be accepted for use, "researchers" have serendipitously "established" that the new machine approaches are not only cheaper, but critical to the question of acceptability, better than manual review.

So the argument has gone, courts only need to accept that predictive coding "runs better" than manual review to accept its use, and don't need to surpass some other more troublesome performance benchmark (roughly speaking, the bear's speed).

But an objective analysis of the available data indicates that the comparative performance assertion is supported by-- well, trade association of boot sellers, as it were. As I've noted before, these tests, relied upon to support their claims, are heavily flawed.  At the very least, shouldn't the performance claims be looked at carefully by courts in this kind of situation?

To avoid being eaten -- as it were.

Wednesday, August 27, 2014

eDiscovery "Dark Lakes": Beyond Content-Centric Analytics for Predictive Coding

In examining the Enron corpus, what becomes clear is the sparsity of meaningful terms in email communications.  With boiler plate and noise words remove, more than have of the email corpus only has a handful of terms.   This I have confirmed with many other document sets.  In addition, in most collections, emails represent a large majority of documents.  These observations, taken together, establish that when predictive coding is employed, there are often ediscovery dark lakes of un-analyzed documents that will be quietly overlooked.

Because statistically-based pattern detection methods pretty much universally used in predictive coding perform increasingly erratically as documents become terse -because there is very little information to distinguish noise from signal - these approaches are reduced to behaving like Boolean keyword searches i.e. the test comes down to whether a significant "hit word" is or is not present.

It may well be that current content-based approaches are not effectively analyzing a significant plurality of document collections.  Of course, because of the work flows employed, these documents are being ignored in a very discreet manner.  @#$ that happens in a predictive coding engine, stays in the predictive coding engine, as it were.

Content-based systems alone won't solve this problem.  Context-based analytics and annotated systems can, though.