Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Sunday, July 26, 2015

Predictive Coding Has Another Practical Application: Detecting Subtle PII Using Regular Expressions (Regex) and "Issue" Scores

     The presence in collections of "Personally Identifiable Information" (PII), which implicates privacy laws and data breach compliance requirements, is a recurring challenge in ediscovery tasks.  One tool used by practitioners to identify common PII is the regular expression.  In essence, regular expressions are pattern screens that enable the detection of specified character strings, whether they be in the contents of an Excel cell or a document.  Many litigation support-related applications, such as TextPad, UltraEdit, Adobe, etc have the ability to search files for specified regular expressions.  Many programming languages also provide libraries of commonly used expressions for Social Security numbers, credit card numbers, telephone numbers, email addresses and other similar personal-centric information.  Basic regex expressions even provide for some "fuzziness" in pattern matching which can help to encounter for example both the Social Security number designation  "111-11-111" and "111111111".

     I have used the regex capabilities along with the linguistic analysis features of IBM SPSS Modeler extensively for a number of years.  Modeler enables the user to not only screen for preset common PII expressions using preset expressions, it faciltates more nuanced PII detection.   For example, I have used the linguistic resources to create rules that apply broadly defined regex screens that also require the presence of specified terms or even "term types" (taxonomy tags for collections of topically similar terms) within a specified proximity of the possible PII regex pattern.  This enables a more robust recall of true targeted PII items while minimizing precision loss by excluding noise strings that would otherwise.  A case study of this use would be to accurately identify the subset of emails and documents that contain international account information where there was no known expression set that would comprehensively identify account IDs because they were too variable.  Modeler also provides the ability to combine this type of analysis with predictive coding type scoring using supervised learning algorithms and training sets to further refine the screen.  Using these approaches PII information that contains little (or even no) identifiable PII can be more readily and accurately detected.

     This brings me to the notion of using  predictive coding  tools commonly available through commercial ediscovery vendors in combination with regex screens.   In cases where comprehensive accurate PII identification is a critical task, creating an algorithm that combines broadly defined regex screens with "PII likelihood" scores produced using a manually trained predictive coding model can provde significantly enhanced PII recall and precision. In developing this type of approach, practitioners should look wherever possible to create and augment persistent libraries of expressions, rules, and "PII issue code" term sets and models for reuse wherever practical. 

     It should be noted that this approach can be used in discovery projects outside the U.S.  For example, it can be implemented to provide evidence of the employment of best practices in complying with the EU Data Protection Directive.

Thursday, January 29, 2015

Benford's Law and Email Patterns: an Observation

Benford’s Law is a well known phenomenon to those involved in fraud investigations. It is named after physicist Frank Benford who was one of the early reporters of the distribution. The distribution of first digits, according to Benford's law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. For those interested, Radiolab did a nice piece on Benford’s law that can be viewed here. The distribution of first digits in many sets of naturally occurring number sets as predicted by Benford is as follows:
It might be of value to determine if email communication patterns are amenable to Benford Law analysis. Indeed, Topiary was requested to consider whether email distributions conformed in any way to Benford’s Law.

 Here's a short synopsis. We first looked at approximately 165000 publicly available Enron emails from 56 custodians. Some descriptive analytics are as follows:
   Interestingly, at least with this email set, the first digit analysis of the daily totals for all custodians is somewhat consistent with Benford's Law, as shown here: 

 But what about each custodian? Do their first digit totals mirror the aggregate, or is there a distribution curve? The answer is displayed here: 

So not all custodian email totals adhere to Benford's Law. It would be of interest to understand why certain custodians are way out on either end of the normal distribution curve. It may be simple random variation, or it may be indicative of some underlying characteristic.

That's a topic for another day...

Saturday, October 18, 2014

William Webber Posts: A Bean Counter's View of Training Approaches in Predictive Coding

In two posts, one and two, text analytics expert and researcher William Webber discusses a model, and revised model, to estimate the relative comparable costs of four different training methods employed in analytics-enabled ediscovery.

Worth the read.  The second post displays some spreadsheet views of per unit review costs as prevalence varies.  More fundamentally, it promotes the notion that how you conduct your pre-culling (if that is done) and review may impact the type of tool you may wish to "rent" from a vendor, and how altering the predictive coding workflow can have significant cost implications.

Wednesday, September 24, 2014

From Deloitte, A White Paper on Tackling Complex Investigations Using Technology-Assisted, Human-Driven Methods

A very informative whitepaper from Deloitte, Investigative analytics - Enhancing investigative capabilities for cross-jurisdictional and transnational crimes, presents the issues related to the investigation of complex activities like transnational human trafficking, and the challenge of providing the right interplay between human and machine to aid in combating them.  

The gist of that interplay:

“Instead of relying on technology alone to break a case, the investigative analytics methodology uses advanced analytical tools to enhance the essential human element: the investigator’s instinct.”

Tuesday, September 23, 2014

The Two Deadly Sins Hobbling Technology-enabled Solutions in eDiscovery

There are some interesting observations about predictive coding’s “failure to launch”.  Here are two recommendations that mitigate most if not all of the impediments to technology-enabled solutions: inordinate cost and unsatisfactory performance.

Cost: Reduce Prices
Most knowledgeable people in the industry are aware that predictive coding vendors adopted a discount-off-of-the-alternative-manual-review-cost pricing model that assured hefty profits but bore no rational relationship to their own actual costs.  Not only did this in some cases result in no actual savings; most importantly, it has occasioned the re-introduction of a process that advocates promised it would replace: key word culling. As the recall metrics in the Biomet case establishes, even key word culling done well reduces recall in a very blunt way.  Moreover, the current trumpeting of key word culling prior to predictive coding use, in some cases advocated by the very same people who condemned its use in order to tout predictive coding early has significantly damaged the credibility of predictive coding advocates.  This has not helped the cause of technology-enabled solutions.
Fundamentally changing the vendor pricing model in a way that de-couples document collection size from price would both decrease the price tag for predictive coding use generally and would obviate the need for key word culling before predictive coding usage in many cases. 
In addition, it would return some of the reduced ediscovery revenue to law firms through the practices that could be adopted in recommendation #2 because vendors are not squeezing as much profit as theoretically possible from the technology solution.

Inertia: Make the Technology and the Processes Better
Invest in developing better technology solutions and develop formal processes to achieve better performance measures. Currently, technology and service vendors, as well as legal and consulting TAR experts appear to be entirely invested in promoting the notion that predictive coding performance is superior to that of traditional review.  In my experience, and reports of user reluctance supports this, users aren’t buying those claims.  Technologies and techniques to reduce the risk of missing important information in a production exist; however, it appears that the industry in main is dedicated to exacting profit by aggressively making somewhat dubious performance claims and aggressively marketing existing solution platforms, rather than investing in better more cost effective solutions that user concerns.
Recognition of current limitations and an improved approach would have the added benefit of removing reliance on subject matter “experts” and claims of secret skill sets.  In effect, it would rationalize the process by permitting the use of standard methodologies and transparent reporting of results of process steps.  It would also serve to militate against successful opposing party objections to technology enabled discovery as well as judicial reluctance to approving its use.
Moreover, as mentioned above, the reduction in cost will allow law firms to recoup some of the (admittedly reduced) revenue in the enhanced more manually driven quality assurance steps in the process.  This, in turn, at the very least removes some of the law firm disincentives to considering technology-centric solutions.

Friday, September 5, 2014

Comparative Performance in Predictive Coding: you miscalculated comparative performance; the bear eats you

There's an old joke about two campers who are awakened in their tent one morning by the sound of an approaching bear.  Upon seeing them, the bear charges.  One camper takes off running.  The other begins putting on his boots.  "Why are you doing that?  You can't outrun a bear!" says the runner.  The other looks up and says, "I don't have to outrun the bear, I only have to outrun you."  To the extent it is funny, it is so because it re-frames the analysis of how to envision a successful outcome.

A different kind of joke about comparative performance has been played on the courts with the promulgation of the claim of superior machine performance in ediscovery.

It goes something like this:  eDiscovery is never perfect. Determining the appropriateness of a novel information retrieval technology could by thorny.    However, in response to court concerns as to whether predictive coding discovery approaches should be accepted for use, "researchers" have serendipitously "established" that the new machine approaches are not only cheaper, but critical to the question of acceptability, better than manual review.

So the argument has gone, courts only need to accept that predictive coding "runs better" than manual review to accept its use, and don't need to surpass some other more troublesome performance benchmark (roughly speaking, the bear's speed).

But an objective analysis of the available data indicates that the comparative performance assertion is supported by-- well, trade association of boot sellers, as it were. As I've noted before, these tests, relied upon to support their claims, are heavily flawed.  At the very least, shouldn't the performance claims be looked at carefully by courts in this kind of situation?

To avoid being eaten -- as it were.