Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Thursday, August 21, 2014

Predictive Coding, Document Type and Relevance Issues

A little over 2 years ago, in Sampling with Herb, I mentioned the phenomenon of relevance differences based upon document type:

"3.  In my experience, how documents are responsive to any particular issue correlates somewhat with the document type (emails vs. spreadsheets vs. text documents).  Should stratified sampling based on document type be used first in the seed document set going into the predictive engine to optimize the identification of small clusters of relevant documents of an infrequent document type?  And should stratified sampling be used again to examine the no-hit population?  And if one accepts that in principle this should be done, is there a rational process to determine when and how it would be implemented?"

Reading some of the current blog from the ediscovery information retrieval corner, it looks like the topic has surfaced - I suppose one would say  - as a "cutting edge" area.

I've built models based upon document type and what I've concluded is that they are fairly cumbersome.  

Let's see what the research community has to say in a couple of years.

Friday, August 8, 2014

Copyright Litigation Analytics Research

"One of the most important ways to measure the impact of copyright law is through empirical examination of actual copyright infringement cases [...] allow[s] us to examine a wide variety of copyright issues, such as the rate of settlements versus judgments; the incidence of litigation between major media companies, small firms, and individuals; the kinds of industries and works involved in litigation; the nature of the alleged infringement; the success rates of particular parties and claims; and the nature of remedies sought and awarded [...] to identify ways in which copyright litigation differs from other civil suits and to show that certain plaintiff characteristics are more predictive of success."

Interesting analytics application research can be found here.

Tuesday, July 29, 2014

Predictive Coding and Predictive Information Governance Classification are Not the Same Thing - Topiary Discovery, originally published 2013

Potential customers are now being bombarded by claims about re-purposing predictive coding tools for information governance.  While there is definitely overlap, the objectives of the two tasks are not equivalent.  The differences require distinctly different analytic capabilities.

For example, many predictive coding tools are suited primarily for binary classification, relevant vs. non-relevant or privileged vs non-privileged.  Moreover, the workflow is designed for iterative linear model building.  Many systems have no capabilities to comprehensively capture and reuse institutional knowledge about document taxonomies that can dramatically aid in information governance.

Moreover, a robust classification system must allow for individual information items to be classified into more than one category.  An email can be subject to a specific legal hold, a financial regulatory retention rule and document containing valuable client feedback. A document may be tagged because it contains personally identifiable information and separately because it is both subject to non-disclosure agreement requirement and is considered a legal record.

Finally, information governance systems must make much greater use of various forms of metadata than current predictive coding solutions analyze.

Therefore, generally promoting standard predictive coding solutions as an information governance tool is a facile claim.  Buyers should not be dazzled by the claims of superior analytic technologies but rather should consider carefully the capabilities of predictive coding solutions against their real information governance needs.  Consider testing proposed solutions on a few discrete disparate projects to see if the solution has value, and whether it captures institutional insights in a manner that lends itself to broader enterprise usage.

(c) 2014  Originally published 12/31/13.

Sunday, July 27, 2014

Digital Documents in High Stakes Investigations: Like Mosquitoes In Amber


N.B.: Now the Citigroup email can be added to the case study example set.

NJ Bridgegate
CitiGroup Email

Where at all possible, the beginning of any serious investigation requires the casting of a wide net for the digital record: the emails; text messages; and other media splotches of communication about the issue at hand.  They usually present the best opportunity to get the “best” truth about a matter at issue by providing a reliable factual framework to test witness statements..,something akin to getting the mosquito DNA trapped in ancient amber.

And so it is, for example, in the current New Jersey federal investigation, for which subpoenas have been issued.

What is likely taking form in the wake of these subpoenas is a phalanx of attorneys and support staff representing a number of stakeholders now involved in the various inquiries.  And from an ediscovery standpoint, it is likely a small cottage industry of contract lawyers and ediscovery vendors will emerge like desert rain frogs.

All of this to examine and identify facts trapped in digital records, to try to get to “truth” that now (if the parties involved are not foolish enough to think otherwise) are virtually unalterable.  These will almost invariably include very blunt statements (like the text message that presaged all of it) that leave little wiggle room in their import, immutable and impervious to the ministrations of a lawyer’s counsel.

Where the stakes are high (often they are really not in civil litigation), having the digital record before finalizing interviews with the parties who possess information is a key advantage sometimes overlooked by attorneys not versed in investigative techniques who conduct those interviews.  A witness who perceives that the interviewer knows more than they thought he or she would, can become disconcerted, and this can result in more truthful statements.  Similarly, attorneys armed with sufficient factual detail prior to an interview have the luxury of either confronting and contradicting or framing and boxing in witness statements.

This presents the case study as to why electronic discovery is about more than dry academic concepts like recall or precision. It’s about finding the documents that singularly help ascertain truth.  Some involved parties will have a different view on the desirability of successfully identifying such documents, but those who recognize the critical value of truth-finding in investigative and judicial processes should take careful note when evaluating the ways in which ediscovery is conducted.

(c) 2014   Originally published January 24,  2014

Saturday, July 26, 2014

Note to Counsel: predictive coding protocols will have more impact on the outcomes of your cases than depositions

For those attorneys in complex litigation, who have a real interest in assuring that their client is receiving all that is reasonably due to them in discovery, it’s time to reclaim the discovery process from technologists.   This piece is not a diatribe against technologists; it is a call for attorneys to become more objectively informed so that they can own the “counsel” hat even where litigation involves "big data".

 One example of where experienced trial attorneys need to provide guidance and a “reality check” is in the area of technology-assisted review (hereinafter “predictive coding”).  

Before the majority of grizzled attorneys move on, let me say this:

in many cases knowing what is going on when your opponent proposes to use predictive coding will have more impact on the outcomes of your cases than depositions.

Now, it is safe to say that you and your litigation teams are being bombarded by predictive coding claims on an almost real-time basis.   The two main claims have to do with: 1) cost and time savings; and, 2) performance being as good or better than human review.

In terms of cost savings, predictive coding advocates promise to cut ediscovery costs.  As vendors price services more reasonably, this will become an increasingly accurate assertion.  And because the machine works quickly compared to humans the time required to finish an ediscovery project is shortened considerably.

The focus of this article deals with the second claim, specifically the current use of the recall measure in ediscovery and it relationship comparative performance claim.  

It is these weeds where practicing attorneys really need to spend some time to dig beneath the industry palaver and examine what predictive coding protocols mean in terms of the discovery that they receive.  Acquiescing to a protocol that in actuality facilitates under-production potentially means not receiving key information.

Let’s start the discussion of one aspect of recall as well as the claim of comparative performance by examining an important phenomenon found in ediscovery document collections:  document clusters.

Document Clusters

Every document corpus in ediscovery is a symphony of clustered sets of similar documents, clusters being defined by e.g. the co-occurrence of the terms used in the document).  In the most extreme version, this is represented by near-duplicate document clusters but the clusters can be loosened to include similar non-duplicate documents.

There is a hypothesis in information retrieval called the cluster hypothesis.  It has been validated by experience in ediscovery and within information retrieval research more broadly.  It holds that “documents in the same cluster behave similarly with respect to relevance to information needs"  Introduction to Information Retrieval

Notice that the hypothesis above describes document behavior.  

Now here’s a fairly well established property (as opposed to a hypothesis) of information retrieval using machine-learning-based predictive models:  "predictive models make similar predictions as a function of the similarity of documents".

Barring the occasion where similar documents contain specific words that distinguish relevant from non-relevant (and for which predictive coding is not best suited), highly similar documents will receive the same predictive code.
And here’s a statement of a hypothesis concerning information retrieval using humans who randomly review documents: "humans will code documents differently in the same cluster".  There is support in the research literature for the assertion that reviewers disagree on coding for the same document e.g. Roitblat, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010; and Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5): 697–716, September 2000.)

It is not much of a stretch to conclude they would differ to some degree on similar documents. 

To see the impact clusters have on predictive coding measures and performance claims, let’s continue on to a discussion of recall.

Recall in a Clustered World

Recall is the proportion of retrieved relevant documents out of the entire relevant set.  It is a measure developed to facilitate feasible academic information retrieval research decades ago.

The use of recall requires some assumptions. One of these, a very unrealistic one, is relevance independence.  Meaning the relevant information contained in one document has no relationship with the relevant information in another.  Although never mentioned in industry literature, this inherently makes recall a very suspect measure of predictive coding performance in a world of clustered documents.

 An example of the warping effect of clusters can be found from a prior blog post here.

Retrieval Performance in eDiscovery

Attorneys, though Luddites some maybe, should not be cowed into blind acceptance by technical jargon.  Predictive coding is a tool for use in litigation.  Attorneys should apply their finely honed sense of skepticism to claims made by experts from other areas. 
After all, this is still our house.

So from a litigator’s standpoint, what is ediscovery performance mean, and does recall measure retrieval performance given the objectives of discovery?  The answer to the first questions is given next.  Upon critical examination, it can be strongly argued that the answer to the second is no.

Retrieval objectives in ediscovery are markedly different than in information retrieval research exercises.  Performance in electronic discovery -- and this is fundamentally different than the information research mindset, likely very alien to it-- should be measured in part as a proportion of all novel information that has been retrieved and produced; novel information means information that has not already identified in the retrieval process.    (In my experience, trial attorneys “get” this immediately; researchers cling to the research paradigm. I attribute it to, as McNulty noted, “different tribes”).

Recall Doesn’t Work as a Retrieval Performance Measure in eDiscovery
From a litigator’s standpoint, the operational objective of electronic discovery is the production of relevant non-privileged information that can be reasonably identified and produced.  It is an information-centric objective; it is not document-centric.  Therefore, a document centric measure like recall may not be valid.  Indeed, where relevance independence is not present, as in a corpus full of clustered information documents, it becomes highly suspect.  Remember, recall doesn’t directly measure information conveyance; it counts documents.  In a clustered document corpus this means that recall grants to every identified relevant document in a cluster an equal aggregative performance “point” even though the value of subsequent documents from within the cluster is nil or minimal.

The issue of recall use is one of those places where attorneys need to step up and wear their attorney hat.  That means reminding oneself as an attorney that predictive coding is not an academic exercise and that in an environment where digital information is crucial, decisions about how discoverable information is identified and produced are very important.  No trial attorney spends time wondering whether they received 60% or 70% percent of the relevant documents, they want to know if they were denied important discoverable information.  Attorneys should argue for a measure that conveys that information.

Recall’s Slight-of-Hand When Comparing Predictive Coding to Manual Review
In a clustered document environment (and they all are), because machine techniques will mechanically code documents within a similarity cluster entirely consistently, clusters will be entirely coded either correctly or incorrectly.   And unless (in the very unlikely event) the predictive coding model correctly identifies all documents containing relevant information, if recall is used as the measure of how well the system did in information retrieval, the model will receive an artificially inflated performance grade by virtue of the multiplier effect of consistently coding clustered documents.  In effect, predictive coding recall measures can generate the appearance of better performance by producing relatively large proportions of relevant documents that in reality contain a relatively narrow band of information.

For human retrieval processes, the effect of clusters on recall is likely just the opposite, and it is a very unjust blow to human information retrieval abilities.  The disagreements found within reviewer retrieval processes noted in the studies mentioned above, and so often ballyhooed, should in fact be viewed as an elegant (if not-cost effective) redundant information-centric system that trades some precision for the assurance that at least one or some instance(s) of the information contained within a document cluster is correctly coded and included in a production.  When viewed from a litigator’s information-centric perspective, the traditional disorganized internally inconsistent review process provides much more robust discovery than recall measures indicate.  Unfortunately, recall is too blunt a measure.

The assertion made here for attorney consideration is that if tested using an information metric, rather than the simplified document count recall measure, human review performance would once again be seen as the gold standard of performance. 
A legal-centric view of ediscovery performance renders the claims about the comparative (or superior) performance of predictive coding approaches unsupportable, and renders reliance upon the studies that assert them ill-advised.  Predictive coding may well cut costs, but it comes at a price of diminished information discovery.  That realization should lead courts to a significantly different view of current protocol proposals.

Finally, although this is never discussed by the community of predictive coding promoters, attorneys with real clients in disputes where a just resolution may well turn on the quality of ediscovery should resist “the illusory truth effect” of industry hyperbole and begin discussing issues like this… with their colleagues as well as the courts.  

Thursday, July 24, 2014

The Evolution of Insight Creation: Beyond Content-based Predictive Coding

The content-based statistical predictive coding performance assessments currently advocated  are both squishy and uncertain.  Sampling to determine yield leads to wide margins of error.  At levels below full recall (and how do you ever know you have that?) recall tells you nothing about the information gain in the production, or the relative diversity of produced documents.   Stakeholders are left to argue proportionality based on very fuzzy, incomplete information.

But why should predictive coding stop when content-based analysis has been exhausted?  And why should statistical methods be the exclusive benchmark?  Investigators, auditors and compliance professionals are not so restrained.  In investigations it is important to gather information from data, and if possible to use what is gathered to form develop greater insight and identify additional data.  One must, as Assistant Attorney General Henry Peterson said so many years ago,"follow the money".

Big data has spawned big analytics to do this very thing.  Advanced predictive analytics has been applied in ediscovery; it's commonly known as predictive coding, technology assisted review, and on and on...   Tunnel vision has limited predictive analytics in ediscovery a bit however.   There seems to be little or no effort to extend analytics beyond the first round of analysis, the content round.  There is no attempt to unleash the metaphorical analytical bloodhounds.

That's a missed opportunity because content-based predictive coding itself generates patterns.  Some have suggested approaches to identify additional documents based on what has been found through content-based predictive coding, but they are oddly regressive, requiring that humans sift through the information in an ad hoc fashion, seeing only what humans can manage to see.  Not an unjustified approach, only limited.

After content-based predictive coding has created the data points of a multi-dimensional document excavation site (shown conceptually above), analytics has not finished its usefulness.  There is more to glean from extracted, implied and derived metadata.  More than what the eye can always spy.

Information identification will be enhanced by the application of advanced analytics beyond content-based predictive coding.  Parties in ediscovery will not only say "we achieved somewhere between 40 and 80% recall", they will be able to say in addition, "we furthered identified X number of additional important documents by predicatively determining where they likely would be".

The evolution continues.