I had the opportunity last week to have a great conversation with Sandy Serkes, CEO of Valora Technologies. I was pleasantly surprised to learn that Valora has a track record working with predictive modeling applications for document sets -- a company to watch as review processes move toward machines doing more heavy lifting.
We talked a bit about the need for sampling methodology design competence. It’s my belief that any well-reasoned judicial decision assessing the acceptability of auto-tagging will have to discuss some minimum standards for sampling design and process in order for the industry to rely upon it to move forward towards greater machine-assisted reviews. But what will they be?
Let me start by saying that I don’t pretend to be an expert in formal statistics or sampling methodology. I have training in basic statistics and probability, as well as survey methodology. I also have experience creating sampling methodologies for compliance and monitoring programs. I’ve spent a good deal of time exploring the characteristics of document sets and their effect on investigations and discovery.
With that said, let me start with some observations. Sampling design requires an understanding of what outcome or outcomes the sample will measure. This is required to assure that the sample will serve as an accurate proxy for the entire population. In the case of e-discovery this translates to issue determination, construction and sampling design.
Issue construction is commonly based on subject matter expert determinations alone, or in conjunction with the structure of document requests. A document’s relevance to an issue is defined by one or all of the following characteristics: linguistic and non-linguistic terms; concepts; intonation; parties related to the document; and date ranges. Theoretically, each issue should be somewhat distinct, identifiable by its content/metadata signatures. In reality, however, issues are fluid and idiosyncratic. Document terms and concepts often create overlap among multiple issues. Hence documents with multiple issue tags.
In addition to the issue of issue identification, there is the phenomenon of relevance deconstruction within each issue. Anyone who has worked on document sets knows that for even individual issues there can be multiple ways in which document contents create relevance. A simple anecdote: In a defective products case, documents that evince knowledge of the defect are relevant. Technical people within the target organization document their realization of their defect in highly technical, abstruse terms. These documents are relevant. Separately, other departments becoming aware of the problem discuss the matter in entirely different ways from the engineers and R&D people, marketing people one way, compliance people another, and C-level personnel yet another – all based upon their job function and unit culture. All of the resulting documents are relevant to the issue of knowledge. Clearly, no simple sample taken from the whole population will be adequate to assure representation from all of these distinctly relevant sub-populations (e.g. R&D, marketing, etc). And without representation, they will likely not be identified as relevant by predictive models. (Despite some claims to the contrary, predictive models are not magical.)
These issue/population attributes have an obvious effect on population and sub-population determination, which in turn drives sample selection decisions. I do not believe that if challenged simple random sample generation - - ~500 documents for the entire population to claim a 95% confidence level with a specified margin of error – will be generally judged acceptable given these realities. Defensible sampling will require interplay between subject matter experts and people competent to construct sampling techniques that reasonably assure that samples adequately test for specified issues, and that test results can be reasonably inferred to represent the whole population of documents. This will invariably require techniques more elaborate than plugging population numbers into a random sample calculator to determine sample size.
The framework for assessing a party’s reasonable diligence in determining issue definition, sub-population treatment and sampling technique (e.g. separate random, stratified, etc) should be a central discussion in upcoming judicial narrative. This will give stakeholders the ability to construct processes that improve the quality of e-discovery output, dramatically reduce costs and project time lines, and protect privilege, while ensuring that parties have not incurred additional risk in their adoption.