Thursday, July 24, 2014
The content-based statistical predictive coding performance assessments currently advocated are both squishy and uncertain. Sampling to determine yield leads to wide margins of error. At levels below full recall (and how do you ever know you have that?) recall tells you nothing about the information gain in the production, or the relative diversity of produced documents. Stakeholders are left to argue proportionality based on very fuzzy, incomplete information.
But why should predictive coding stop when content-based analysis has been exhausted? And why should statistical methods be the exclusive benchmark? Investigators, auditors and compliance professionals are not so restrained. In investigations it is important to gather information from data, and if possible to use what is gathered to form develop greater insight and identify additional data. One must, as Assistant Attorney General Henry Peterson said so many years ago,"follow the money".
Big data has spawned big analytics to do this very thing. Advanced predictive analytics has been applied in ediscovery; it's commonly known as predictive coding, technology assisted review, and on and on... Tunnel vision has limited predictive analytics in ediscovery a bit however. There seems to be little or no effort to extend analytics beyond the first round of analysis, the content round. There is no attempt to unleash the metaphorical analytical bloodhounds.
That's a missed opportunity because content-based predictive coding itself generates patterns. Some have suggested approaches to identify additional documents based on what has been found through content-based predictive coding, but they are oddly regressive, requiring that humans sift through the information in an ad hoc fashion, seeing only what humans can manage to see. Not an unjustified approach, only limited.
After content-based predictive coding has created the data points of a multi-dimensional document excavation site (shown conceptually above), analytics has not finished its usefulness. There is more to glean from extracted, implied and derived metadata. More than what the eye can always spy.
Information identification will be enhanced by the application of advanced analytics beyond content-based predictive coding. Parties in ediscovery will not only say "we achieved somewhere between 40 and 80% recall", they will be able to say in addition, "we furthered identified X number of additional important documents by predicatively determining where they likely would be".
The evolution continues.
Monday, July 21, 2014
Topiary discussions from Summer 2013 that discussed the use of diverse content samples and Zipf distribution of similar documents
Imitation, they say, is the sincerest form of flattery...
Monday, July 14, 2014
The IBM cyber security team has posted an interesting concise cyber defense model that implicates information governance and text analysis.
The model calls for real time analysis of cyber risks. A central requirement to the response model is the ability to know in real time where and what "high value assets" are to include in predicatively gauging risk of possible advanced coordinated attack modes. Because such assets may take the form of unstructured information, and because the existence, content and locations often change over time, information governance analytics can help in providing up-to-date information on high asset information within the enterprise.
The page can be found here.
Monday, July 7, 2014
Part 3 of Series in ACEDS Critiquing Current Assumptions Underlying Predictive Coding Protocols
Friday, July 4, 2014
Tuesday, June 3, 2014
Below is a follow up to an older post with some additional, more concrete suggestions for technology assisted review protocols in the ediscovery response (as opposed to investigative) content.
- If the use of low resource document culling prior to TAR use is being considered, avoid arguments about the diminution of relevant material by employing methods more nuanced than keyword application. These methods are not as cheap as the keyword scythe but they need not be as expensive as current TAR solutions and can be much more recall-protective than keywords.
- Oppose disclosing documents used for seed and/or training purposes. There are legitimate reasons for being concerned about such disclosures. Moreover, providing requesting parties with the documents used in the training process really doesn’t give them the information that they need anyway because they have nothing to compare to that set except their hunches, and perhaps documents that they have obtained. So, the disclosures are not reliably an act in furtherance of transparency or cooperation. They do however reliably raise risk. Therefore, instead of disclosures, offer to include a defined number of training documents from the requesting party.
- Estimate baseline prevalence using a content diversity maximized random sample (this is a sample that takes into account overlapping content and optimizes for a set of documents with differing content).
- Measure recall and precision at the end of the process using the same technique.
- Conduct content-based analytical protocols to identify relevant documents (the predictive coding or TAR protocol process). However, have reviewers additionally code documents that meet a defined level of relevance (e.g. coded “Important”) – this generally should not be limited to “smoking gun” documents. The definition should be simple. Note that variations in codes here does not impact the content-based coding process.
- If recall is not exceptionally high, utilize a risk reduction process that attempts to mitigate the likelihood of “lost” important documents. Because the set of these documents is almost universally tiny, statistical sample selection is not the appropriate method to reduce risk. Fortunately, the information available about “Important” documents can be leveraged to apply additional analytics to the information about documents rather than exclusively the information in documents. Analytics that are not too far off conceptually from predictive coding algorithms can identify non-trivial non-obvious patterns that focus on document clusters that are most likely to contain additional information. Whether this process identifies additional documents or not, if performed reasonably, it provides reasonable analytics-based method to diffuse objections to TAR usage based upon the concern about the non-production of important discovery documentation.
Some thoughts for increasing the successful usage of predictive coding methodologies:
Emphasize cost savings and qualify (or drop) the comparative performance claims - Quantify the relative costs of standard human review over a spectrum of corpora sizes. Acknowledge the legitimacy of attorney beliefs that in the general case, when measured appropriately, humans accomplish better discovery.
Assure that within the framework of proportionality the trade-off between reduced cost and reduced performance can be defensible.
Assuage concerns over the risk presented by diminished ediscovery performance by drilling deeper into performance assessments than recall and precision measures. Turn the conversation from document count estimation to assessments of the materiality of information produced in discovery.
Recommend systematic as well as ad hoc risk-reduction methodologies that demonstrate defensibility within the proportionality framework by either identifying additional material information or establishing a reduced probability that material relevant information was not produced. (this will likely required improved processes.)
Finally, offer (free) to structure an internal test using previously coded data, to perform the predictive coding protocol using the information-based processes and to obtain a more robust assessment of the quality of identified information vs. non-identified material information.
Saturday, April 26, 2014
Harvard Business review article, Consulting on the Cusp of Disruption, on McKinsey's evolving strategy has some insights for law firms. In many areas where firms now supply services, often through vendors, the services are increasingly from the gray area between law and analytical solutions, e.g. ediscovery, technology assisted review, information governance, cyber security. If firms want the much sought after "relationship"with clients then why not own software that the client uses routinely?