Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®

Friday, April 18, 2014

Lower Cost, Higher Recall/Precision Pre-culls: a Place for a Spectrum of Analytics

A proposed workflow for very large data sets:

  • Sample the corpus in situ, or as close to the source as possible. * Do not pay for processing, loading of the entire set*.
  • Use a  robust near-duplicate clustering algorithm to identify and code large tracts of near-duplicate documents.  Use that information to formulate a search string table.  Use that to code very large document tracts with great accuracy.
  • Use known documents, etc to formulate searches of the sample set.
  • Analyzing all coded sample documents, accumulate meaningful relevant and nonrelevant terms and phrases.
  • Test these and obtain recall and precision values (search strings for the large near-duplicate sets will almost always be highly reliable and nonrelevant).
  • Perform cluster analytics on email metadata (generally the largest component of collections). 
  • Construct content and metadata models.
  • Eliminate precision loss by integrating metadata model with search strings.
  • Apply model to entire email collection.  Apply search terms to other documents.
  • Carry the “hit” set to the next step.
  • Only export native "hits" and family members to the hosting vendor for processing, loading and hosting.

Sunday, March 30, 2014

DOJ Illegally and Secretly Obtaining Data – Oh My!! … Well Not Really

A recent Lexology article trumpets “District court in Washington D.C. exposes illegal e-discovery by Department of Justice”.  After reading the article I have to confess that I was shocked … not by the government conduct alleged, but by the sensational nature of the title.  In fact, the DOJ acted in no illegal manner.

According to the opinion of the Court upon which the article is based, that Court took issue (and by its own language this opinion is at variance with some other judicial views) with the prevailing common language of government digital warrant requests that are presented to it for approval, as being overly broad.  There was no assertion by the Court that the DOJ "has been routinely using over-broad search warrants" as the article states.   

That claim - if  true -  would imply truly unconstitutional and illegal conduct by federal prosecutors.  But the Court never claimed anything even remotely like that.  Rather, the Court complained in part that government prosecutors have been routinely submitting overly broad warrant REQUESTS to the court to obtain emails and other content stored by third-party providers: meaning that the government has been asking for unfiltered collections of data related to feral criminal investigations; here those crimes apparently related to price-fixing.  The Court stated that it had been warning prosecutor to reign in the scope of the requests for data to minimize the impact on privacy issues - apparently to no avail. The Court then stated  how it would treat overly-broad requests by denying them.  It was a pointed admonition to the government lawyers that if they wanted to get information for their investigations they best pay attention.  (By the way, for those unfamiliar with these types of investigations, prosecutors and some regulatory agencies do sometimes issue subpoenas for certain types of data that may go directly to third-party data aggregators without prior judicial review or that require a lower threshold for approval than the Fourth Amendment or 2703.  However this case involves applications for warrants that must satisfy a court that there is probable cause for their issuance before the warrants are executed (i.e. served on the provider).

No allegation of illegal conduct, simply an admonition to shape up or impede your own cases. 

The article also implies something even more ominous: “This [the electronic search and seizure] is done secretly without the knowledge of the person whose email is seized.” It was odd to see this “revelation” of the “secret” nature of the warrant execution in the lead (and it should be noted that in the virtual crime fighting world execution would consist of service of the warrant upon the provider and would not involve the breaking down of any doors, the use of black bags, or lock picking sets),  because the Court took no issue with it.  Secrecy often comes part-and-parcel with warrants because of the need for secrecy in criminal investigations.  The rationale: disclosing the existence of an investigation to suspected criminals would be Clouseauian.

So What is the Harm Here?

The term "illegal" in law generally imputes a violation of some sort.  Actually obtaining data in violation of the law (or Constitution) would be illegal.  Using over-broad language in warrant requests to a court after being admonished not to may be bad lawyering (perhaps) and may even lead theoretically to some form of judicial sanction.  But the act of making the request as constituting something "illegal". . . 


The damage a headline like this risks is the diminution of respect for professionals who shoulder a tremendously important burden in a global business culture that seems to be at risk of increasingly criminogenic behavior.   Federal prosecutors have a critical responsibility to investigate and prosecute serious white collar crimes.  They face a burden of proof - “beyond a reasonable doubt” - that is somewhat alien to those who practice in the civil realm.  In a case involving crimes, like this one, such as price-fixing, they face well-heeled corporate defendants and well-heeled lawyers.  So prosecutors start out to find as much evidence as possible to satisfy that burden; they want to go into court “loaded for bear” as it were.  The warrant requirement is there to authorize courts,  to keep them operating in a way that the court believes comports with the requirements of the law and the Constitution,as the Court in this one has.  The opinion highlights a system where two separate branches of the government operate as intended.  

To impliedly equate the efforts of DOJ prosecutors –as the article does- with something akin to that of some of the data-gathering activities of the NSA that have come to light is unfair and tends to make their job that much harder.

Sunday, March 2, 2014

Successfully Selling High ROI Analytics Projects: the Intersection of Information Security, eDiscovery and Information Governance

Predictive coding is a promising market because discovery always gets budget, but it is hindered in part because of defensibility concerns.  Information governance does not carry significant defensibility liability but does suffer from lack of budget.  Information governance sales also suffer from the mistaken view that the undertaking must be done at an enterprise level.  Much has been made of the logic of transitioning predictive coding from ediscovery alone to information governance.  Much of the discourse has been carried out without much thought as to how companies do (or should do) analytics adoption.  This is a shame because there really is a persuasive argument for sophisticated content-based analytical solutions that provides better return on expenditure for ediscovery projects while enabling a corporation to begin the path towards information governance in a conservative gradual manner.

Predictive coding, as it stands now, and information governance are distinct beasts.   Predictive coding is episodic; information governance, abiding.  Predictive coding is reactive; information governance, proactive.  Yet the distinction presents the opportunity for synergistic complement.  It is not the commonly asserted simple re-positioning of predictive coding for general information governance use.  Rather, it contemplates the use of discrete targeted security and ediscovery information governance applications, integrating information governance modules specifically and exclusively with ediscovery and information security solutions.  This enables the company to explore analytics value by leveraging available budget to solve multiple bottom-line challenges.

The solution space at the intersection of these two undertakings has some precedent in information security solution proposals.  Near real-time protection against exfiltration of valuable intellectual property in the form of unstructured test is one example of an already utilized targeted information governance solution.  A proposed architecture for such a protection scheme has been detailed in the information security research community using near-duplicate (cosine similarity analysis).  The information security solution entails the use of a text indexing server, a Squid security server and a “content-comparer “ server.  The solution provides the ability to block the outbound transmission of items with an index “signature” that is highly similar to an item in the library of signatures of high-value documents. 

 These features can be imagined to provide the architecture for integrated information governance/ ediscovery predictive analytics solutions, and one that provides the additional information security value.  The information security solution would benefit from a more robust document comparison methodology while the ediscovery information governance solution would have a much richer set of input data from which to predict relevance. 

In addition, this type of combined security/information governance solution could be leveraged to extend to ediscovery challenges such as the automated identification and management of the disclosure of corporate documents subject to non-disclosure and confidentiality agreements as well as documents covered by privilege or work product protection.

If you would like more information on this type of approach, contact me at

Wednesday, February 12, 2014

From 2010: Disruptive Use of Predictive Analytics in Compliance and Litigation ... The Application is No Longer New News

Hard to believe it’s been four years since I first posted about the use of predictive analytics for litigation risk assessment in 2010, citing a case study done by an insurance industry colleague and friend, James Ruotolo, who works at analytics juggernaut SAS. 

Here’s the post:

Disruptive Use of Analytics for Enterprises: True Early Case Analysis for Evidence-Based Risk Assessments

   Maybe I should use another term now that ECA (Early Case Assessment)  has been pirated by the e-discovery industry to misnomer e-discovery culling processes.  Maybe Early Risk Analysis.

   Anyway, in an interesting blog post, , James Ruotolo, an insurance fraud maven at SAS, lays out a persuasive argument as to the upside of enterprise data collection and access in evidence-based decision making related to insurance claims investigations and assessments.

   How completely earth-shattering would this approach be to general corporate litigation assessments.  Many civil litigators, used to conducting high level fact-gathering followed by actual evidence gathering long after decision have been made, would be apoplectic. But in some matters at least, corporate clients would be able to more accurately assess whether the extant evidence -- structured, transactional, free-text and emails -- supports the decision-makers' version of reality  (and anybody who has done this type of work knows that the screens in place in organizations often does not provide the C-level an unvarnished  lens  into the sausage factory below.)  As in claims payments, this can reduce litigation costs and and prioritize resources.

   The tools are increasingly in place....

Applying advanced analytics to gauge GRC is not at all new.  FOr example, in 2007, the accounting professions’s association,  AICPA, published a white paper on predictive modeling of claims.
While speaking at an Annual IBM SPSS Directions conference several years ago,  I was intrigued by analytics sessions that applied text analytics to risk score fraud in insurance claims as well as in possible FCPA violations. In 2011, in one of my posts, FCPA, UK 2010 Bribery Act, Risk Management, E-discovery, Predictive Analytics and HarmonyI concluded with the question “So as a start what about harmonized compliance/e-discovery analytical platforms for companies that are considered target rich by investigators?”  And April of last year, the Predictive Analytics World Conference hosted a session entitled To Sue or Not to Sue: Predicting Litigation Risk”

However, as with all  possible big data analytics approaches, it is overly facile to proffer predictive coding  technologies as an adequate compliance risk solution in all but the simplest cases, such as scoring for harassment language.
I have worked with analytics experts on concept pieces for fairly complex analytics platforms that gauge risk in the GRC related area, for example predictive  behavioral analytics to score risk of insider malfeasance (insider threats).  See Methods and Metrics for Evaluating Analytic Insider

This requires analytics much more complex and diverse than predictive text coding and more complicated workflow methodologies.

For example, even for exclusively litigation purposes, a reasoned analysis of predictive analytics yields some insights.  First, the cases with the largest liability exposure (litigators sometimes refer to a subspecies of these cases as “bet the company”) happen too infrequently to model with any accuracy.  At the other extreme, there is a large category of cases where human judgment alone (here think of seasoned insurance adjusters and plain vanilla claims, or repetitive litigation suits with an established bell curve of settlement points where humans can accurately “window” settlement points).

This leaves a worthwhile mid-section of cases where advocating a pilot test of predictive analytics might prove a reasonable ROI.  (And analytics companies have been working toward this for at least four  years). Even within this band, there are complicating nuances. Throughout the paper battles of civil litigation’s paper warriors, fortunes change based upon changing evidence, and changing circumstances, etc.  For a modeling system to be practically useful, there must be a way to identify those data points, acquire them, and  input that data into the model, over time.   This is not nearly as simple as coding a set of documents.

In conclusion, predictive analytics in compliance and litigation risk analysis is not at all new; it can be applied in discrete situations; and it requires more than imply “predictive coding” expertise for many real world applications.

Thursday, February 6, 2014

Effective Methods to Work Through Client Concerns about Predictive Coding Use in eDiscovery

Some thoughts for increasing the successful usage of predictive coding methodologies:

Emphasize cost savings and qualify (or drop) the comparative performance claims - Quantify the relative costs of standard human review over a spectrum of corpora sizes. Acknowledge the legitimacy of attorney beliefs that in the general case, when measured appropriately, humans accomplish better discovery.

Assure that within  the framework of proportionality the trade-off between reduced cost and reduced performance can be defensible.

Assuage concerns over the risk presented by diminished ediscovery performance by drilling deeper into performance assessments than recall and precision measures.  Turn the conversation from document count estimation to assessments of the materiality of information produced in discovery.

Recommend systematic as well as ad hoc risk-reduction methodologies that demonstrate defensibility within the proportionality framework by either identifying additional material information or establishing a reduced probability that material relevant information was not produced.  (this will likely required improved processes.)

Finally, offer (free) to structure an internal test using previously coded data, to perform the predictive coding protocol using the information-based processes and to obtain a more robust assessment of the quality of identified information vs. non-identified material information.

Saturday, January 25, 2014

Proportionality’s Other Shoe: Predictive Coding Will Become a Cheaper “Good Enough” “Point-and-Shoot” Industry

There is a very likely reality that may not have sunk in yet as the predictive coding industry assails the buying public with ever more esoteric claims about predictive coding “engines”.  Customers may not pick them.  This may be the case because when assessed across many litigations, predictive coding systems, no matter how fancy the content-based analytics, will hit a performance ceiling band on recall (let’s say 60-80%).  And as customers begin talking with each other about how their selected vendor solution performs, the superiority of the Audi model over the Ford Focus may not wow the audience given the price differential.  For some illustration, in The Good Enough Revolution: When Cheap and Simple Is Just Fine, the real world case study of point-and-click vs. higher end attachment cameras is discussed (most people wanted the point-and-click).

 Moreover, and this is significant, users may start to realize that the proportionality paradigm not only supports their argument to use predictive coding in the first instance but also empowers them to argue to select the less stellar but also markedly less expensive “point-and-shoot” predictive solution, the one that employs older mundane but reliable and much more inexpensive off-the-shelf components and has the budget to wrap them in best practice workflows.

Finally, the all important law firms may naturally gravitate towards this philosophy.  Spending less on machinery and augmenting legal quality control work flow means they can recoup a bit of the review revenue from vendors that the vendors previously took from them with the introduction of predictive coding.