When predictive coding (defined as technology-enabled processes that replace en masse machine classifications for manual coding decisions) was first introduced to ediscovery customers, the choice was made by vendors to present it Athena-like, springing fully formed and ready for customer use. Consequently, their pricing models were - let us say - steep; the industry offered the predicitve coding feature at a cost that represented a discount of the cost of the alternative manual process. It was a high margin solution.
The trouble with this approach was, and is, that predictive coding-enabled processes don't perform well on all document sets, and miss some classes of documents in all document sets. Predictive coding is seen by end users as high-risk, a gamble only suitable for very large cases when manual review was not feasible.
It was really only after the - to put it mildly - lackluster response of the customer community to predictive coding that vendors began advocating lower-risk applications such as quality control and post-production triage, Here's where the decision to set predicitve coding prices so high came back to haunt vendors. No one wanted to pay the premium for predicitve coding if they weren't using it to cut costs by replacing lawyers. It was a viewed as a very expensive nice-to-have but not cost-justifiable gadget.
Instead of giving the tool to users with some suggested workflows, allowing users to experiment, and most importantly gaining valuable and free customer feedback, the industry doubled-down on pure predictive coding, promoting its use, pushing test results and relyinh on endorsments. None of that has seemed to work well in gaining general cutomer acceptance.
Recently, it appears that the vendor industry has quietly begun acknowleding that predictive coding is the not the pot-o-gold it once conceived it to be. Pricing has become more rational, in some cases downright reasonable, and this has freed up clints to play with the technology, deciding for themselves how to use it most beneficially.
This "democratization" of ediscovery analytics is promising, and may well likeley lead to greater innovative uses.
Tuesday, February 2, 2016
Tuesday, January 26, 2016
There is an important attribute of modern document collections that is mostly overlooked but that needs to be considered when relying on predictive coding to fulfill document discovery obligations: email term count.
The problem can be illustrated by an examination of emails from the Enron document set. Examining a population of approximately 150,000 Enron emails across a number of custodians from which prepositions, pronouns, email disclaimers and header terms have been excluded, there are approximately 20% of emails with 10 terms or less and 30% had less then 20 terms (and it should be noted that the exclusion of email header terms is not comprehensive –the process usually misses a noise terms). The existence of such small emails presents a nettlesome problem because the reliability of the performance of current statistically-based machine learning classification algorithms necessarily falls off as email term counts decrease. This is not to say that machine learning techniques cannot identify small emails: they can, if the email happens to contain a (statistically) key term. However, the paucity of terms in small emails makes correct classification a much more iffy process than when dealing with wordier documents. (This is one reason much of traditional machine-based information retrieval utilized “manicured” homogenous larger documents.)
The cumulative contribution of emails of specific term count sizes to total email population can be seen in the chart below:
Of course, not all collections will exhibit this pattern, but if experience is any indicator, many will. After all, emails are very often just conversation “bits”.
The problem presented by terse emails is a nettlesome one because: 1) emails often comprise 80% or more of document collections; and, 2) emails often contain important informational statements and admissions dispositive to litigation claims and defenses. The problem is exacerbated by the fact that vendors may not even attempt to remove noise terms from emails, misleadingly inflating the term counts for substantively small emails.
This is a problem significant enough that quite often managed review workflows that employ predictive coding classification in lieu of human review of all items will skim off for manual review as “exceptions” a number of the smallest documents in the review set in recognition of the inability of machine learning systems to analyze them reliably. However, as shown in the chart above the practice of shaving off the smallest 1000 documents for manual review does not address the large portion of emails that are effectively opaque to machine classification techniques. For example, using as a guide the data in the chart above a collection with 200,000 emails would embody about 20,000 with ten terms or fewer and 64,000 with 20 terms or fewer. Clearly therefore, the practice of segregating acertain relatively number of smallest documents provides an illusion of quality assurance; however, it does not qualify as a sound practice because it does not satisfactorily address the problem presented by emails that are too small to be reliably analyzed.
Again, this is important because esperience indicates that if there is any correlation between document term count and importance it is an inverse one, meaning the set of smaller documents –especially emails - often contains more important information.
The most pernicious aspect of this problem is that the miscoding as irrelevant of any statements containing information vital to the just resolution of a matter may be irretrievably lost. Consider this for a moment: in some number of cases, the existence of a very small number of emails will require the oft-mentioned “come to Jesus meeting” and lead to settlement talks being turned on their head. The non-identification and production of these items - although supported by current proportionality assessments of the discovery effort - may well, in contravention of FRCP Rule 1, lead to a miscarriage of justice.
Parties being asked to go along with predominantly machine coded workflows proposed by opponents should keep this in mind. Specifically, it should consider that if an opponent has decided that some emails are too small to be analyzed by a predictive coding solution, there are very likely many more emails that are similarly not amenable to predictive coding analysis but that have been eliminated in any case by its application.
Finally, the analytics software industry has a role to play in providing solutions to this hole in automated solution stack. That may not be a welcome message to vendors who no longer invest in analytics innovation but it is a necessary one.