Topiary Discovery LLC

Powered by: Topiary Discovery LLC ... The Home of Predictive Pruning®


follow us in feedly

Sunday, November 15, 2015

eDiscovery Exit? Cyber Security Company Proofpoint acquires Predictive Coding Solution Provider OrcaTec. To What End?

On October 1st of this year, Topiary Discovery posted a piece entitled TAR for Cyber Security: Say Hello to “TARM” (Technology Assisted Risk Management) discussing research work being done at Oak Ridge National Laboratory and coining the term TARM©. 

This week, OrcaTec announced on its website that the text classification and predictive coding solutions company has joined Proofpoint, a threat protection, information compliance and discovery enterprise solutions company that has been actively acquiring companies to broaden its offerings.  

To be sure, the simplest round hole into which Proofpoint can place the OrcaTec widget is in its discovery stack.  However, given Proofpoint’s stated objective of preventing sensitive data exfiltration, it will be interesting to how creative the company will be in operationalizing the OrcaTec stack.

Monday, November 2, 2015

An Accelerated Review Workflow for Occasions When Full Blown Predictive Coding is not in the Cards: Sequential Thread Review; Multi-Tiered Near Duplicate Batching; and Concept Grouping

By now it should be clear to observers of the ediscovery industry that predictive coding (defined as those technology assisted review processes that are designed specifically to reduce review costs by replacing significant human review with automatic coding generated by machine learning models) has not replaced human document reviews wholesale.  The reasons for this failure are manifold.  Contrary to early claims by promoters, most objective practitioners now acknowledge that machine-based reviews are generally not as robust as competent human reviews.  Moreover, machine techniques don’t work for image documents, and don’t work reliably on small and very large documents.  Practitioners have also found that it is not always clear at the outset of a case whether predictive coding is the optimal choice given the quirks of a given collection and associated issues.  Cost uncertainties can be an issue.  Consequently, many attorneys still prefer manual reviews where they are feasible, i.e. when document sets are not prohibitively large.

Of course, manual review workflows have evolved.  Vendors and law firms have developed a variety of techniques to increase review efficiency while maintaining the quality inherent in human review.  This article discusses one type of workflow that increases efficiency substantially and allows for human review for larger collections than can be accomplished by simple manual review, thus removing a larger set of ediscovery matters from the challenging consideration of predictive coding.  It also provides an optional method of incorporating predictive coding in a tactically surgical manner.

A variation of the workflow is discussed at the end of this article, a workflow that utilizes the accelerated review workflow in conjunction with the targeted use of predictive coding on the subset of documents upon which it has been shown to work most reliably. The tactical use of predictive coding I this manner can further increase review efficiency, increase the range of document set size that can be accommodated by accelerated review.

Accelerated Review Workflow

1.      The workflow assumes that the initial document collection has been culled by date ranges (at least those items such as email and social media items that have reliable data metadata) and by keyword culling.  However, the workflow does not intrinsically require that these steps precede its use.
2.      The workflow assumes that traditional document deduplication has been performed as part of the ESI processing function.  Deduplication can be carried out automatically to remove all but one instance of exact duplicates, as determined by a hash-value comparison.  Such hash value comparisons are considered “brittle” – meaning two or more documents are exactly the same down to the bit level or they are not.   Deduplication can be carried “within custodian” meaning only duplicates that occur within a custodian’s mailbox will be removed, or “across custodians”, meaning all duplicates encountered within the entire collection will be removed.  A description of deduplication can be found here:  Again however, this is not a necessary precondition to the workflow’s use. 

  • This workflow assumes that all documents that can be OCRed, have been, and that pure image files and other non-text files are handled as exceptional cases.  Again, not a precondition.
  • The workflow focuses a good deal on email set manipulation because emails and associated attachments often represent 80% of collection content. In addition, email collections contain unique and reliable metadata that can be used to enhance review acceleration.
  • Finally, this workflow assumes that the user has access to standard vendor analytics tools including thread detection and near duplicate identification and clustering.  
  • An important note: the workflow also anticipates that users possess some level of programming resources because no vendor provides out of the box the required technology to accomplish all of the outlined steps.  Most vendors, however, can adapt solutions to accommodate the workflow, albeit at a cost.

Workflow Steps
The workflow is comprised of five major components:

Figure 1 Workflow Sequence of Steps

1.            Thread-based automated email elimination
Emails possess a unique thread identifier that is used to identify groups of emails that form part of email conversations.  Using a vendor’s thread detection tool, identify the most inclusive email thread item for each email thread groups.   This email will contain prior emails in the conversation. So in the general case where, for example, the read/unread status of an email in a specific custodian’s mailbox is not important, it may well suffice to exclude these component emails, and have the vendor tag them consistently with the tags assigned to the longest email.  

The use of email thread reduction will still leave a very large set of documents.  The next step in the process involves three-tiered near-duplicate cluster review. 

2.            Multi-tiered Near Duplicate Batching
Near duplicate clustering is a very useful tool to aggregate documents to be batched out to reviewers.  However, there is a rub that make near-duplicate clustering difficult to use defensibly and reliably in practice: small emails.   Near duplicate cluster reliability is inversely proportional to document term count and below certain term counts, items that should not be grouped together are. Because an emails size is not proportional to its potential importance, grouping disparate emails can cause important items to be miscoded based upon an examination of an irrelevant first email in the group.  This problem is exacerbated by the presence of titles, headers, footers or disclaimers often included in organizational email communications.  Some workflows simply opt to not consider items below a term count threshold.  This assures the validity of the near duplicate groups but at a great cost to efficiency.  In many document collections, a large plurality of items may be comprised of emails that fall below the near duplicate cutoff point.  This reduces the efficiency of the review.

This workflow addresses this. It also contemplates a threshold but also seeks to make groups that can be formed using other family documents, as well as the subject line of emails to regain some of the lost efficiency.  With this in mind, to optimize efficiency while retaining reliability, the near-duplicate clustering workflow must proceed in sequential steps:
  1. Group near-duplicate emails and families together for all emails that exceed a minimum term count and present to reviewers by near-duplicate group
  2. For emails with term counts below the threshold that have larger attachments, group the emails and families together using the attachment and present to reviewers by near-duplicate group
  3. For small emails that do not have any attachments, but that possess content in the subject line , place into one group, disregard reply and forward prefix and sort by subject line content
  4. Group remaining small emails and documents into one group and sort by content.

3.            Concept Grouping of Near-Duplicate Cluster Batches
Reviewers can increase review rates if consistent material is presented together.  To accomplish this, near-duplicate groups can be grouped into larger looser collections by measuring inter–near duplicate similarity using concept, topic or other similarity measure.

Tactical Predictive Coding
Of course, at a certain collection size accelerated review processes alone will not suffice – predictive coding techniques are necessary to create a feasible review workflow. Even here, the accelerated review process described above can be used with predictive coding.  In this case, predictive coding can be introduced conservatively by restricting its use to documents of specified term range – for example documents with term totals between 100 terms and 5000 terms (this will generally include a plurality of the document collection).  This provides for predictive coding of documents within the specified term range but excludes the documents most difficult for machine learning to reliably handle: short documents and very long documents.   They can be managed using the above accelerated workflow.

Users with the necessary programming resources can consider developing simple machine learning semantic analysis techniques to apply to the text version of this restricted document set.  In the alternative, users should negotiate with vendors to use (and pay for) predictive coding only on the restricted set. Users should pre-determine the relative sizes of the document sets that will be included in the accelerated review workflow, and also the size of the document set upon which predictive coding will be performed. 

Friday, October 30, 2015

Reasons Why So Many eDiscovery Analytics Initiatives Have Collapsed and A Brief Outline On What To Do To Succeed

Strategic Talent Gap - they don't invest in talent that can objectively and realistically integrate real client need with real analytics capabilities.  This is often because they have invested only in pure data scientists to develop solutions and (truth be told) entirely nontechnical sales, marketing and consulting professionals on the other to sell them.  This is prevalent but obsolete thinking.  Service providers who wish to introduce successful analytics-centric solution stacks into the market must more aggressively identify and retain people who can leverage deep expertise to really bridge analytics, legal and marketing realms. 

Functional Fixedness - service providers in ediscovery have been successful historically selling basic ediscovery service components that match corresponding tasks described by the traditional EDRM model.  They very often have a blind spot when it comes to recognizing how analytics can be applied to broader but related client challenges.  Many in the industry have been led by data scientists in applying data analytics applications in a vertical about which they are unfamiliar. This has resulted in both efforts to plug analytical solutions into traditional ediscovery task troughs where from the clients' perspective they often don't actually bring real world value, and promoting frankly inadequate analytics-based solutions to ediscovery needs.  It has also resulted in a missed opportunity.  There are a host of techniques from other industries that have not been introduced into ediscovery.  Similarly, there are suitable applications for current solution stacks in areas related to ediscovery that are not being appropriately explored by many vendors.   Vendors must invest in personnel who can weave together lateral analytics into solution stacks and envision applying them laterally to additional client challenges.
Periscope Myopia - years ago, first generation ediscovery analytics solution providers flooded the market with Petri dish and periscope visualizations that appealed to tech geeks but not at all to non-data practitioners.  Surprisingly, this problem is still widespread as vendors roll new analytics gadgets into production.   eDiscovery solution development must be tied very closely to client objectives as well as client feedback.  The experts described above play a central role both in identifying opportunities not realized even by clients as well as in assuring solution development leads to products that clients will want to use.

Investment Constraints - many companies bet big on pure predictive coding quickly becoming the de facto replacement for most human review and that hasn't happened.  They expected to cleave a lucrative section from the large piece of ediscovery cost that human review represents.  For the most part, this transferred wealth hasn't been realized, and it has left many ediscovery companies, from solution providers to service providers in the lurch.  Consequently, there is currently little investment in bold new solutions.  This is unfortunate because the successful application of robust analytics-based solutions isn't that far from becoming a practical reality. Certainly some new innovative approaches are needed to provide solutions that appeal both to clients and to their attorneys, and that conform to their typical business models.  This of course requires additional investment, but the opportunity is there, and it is close.

Thursday, October 1, 2015

TAR for Cyber Security: Say Hello to “TARM” (Technology Assisted Risk Management)

Nestled just west of the Great Smoky Mountains in Tennessee is the Department of Energy’s Oak Ridge National Laboratory.  Within that sprawling lab facility is the Oak Ridge Cyber Analytics (ORCA) Team.  The ORCA development philosophy as stated on its website is as follows:

1.      Focus on challenging problems in the cyber security domain for which there are gaps in available technology.
2.      The speed, volume, and complexity of cyber security data has outstripped our ability to defend systems with manually intensive processes.
ORCA is designed to provide an adaptive, accurate, and reliable analytic infrastructure for cyber security.
The ORCA tool kit contains several components.  Of particular importance to those who are involved in analytical applications in the discovery realm is the Network Data Discovery Engine.  The stated objective of the tool is that should sound somewhat familiar to TAR practitioners:

Mapping the distribution of textual data on a network, including quantifying the value of the information each host contains.
The conceptual similarity is obvious.  The NDDE scores documents based upon an automated or semi-automated supervised learning classification system:

The ORCA Asset Valuator is highly configurable.  Operators have the flexibility to customize the information categories used for scoring.  These information categories are lists of terms and phrases that characterize a document.  For example, an information category called “Anatomy” might be characterized by terms such as “body”, “structure”, or “morphology”.  Additionally, the “Anatomy” category might include terms for all of the human body parts and organs.  An information category is simply a collection of terms and phrases that have a common theme.  In the Asset Valuator, the information categories provide the basis for discriminating the kind of information on each computer and the way in which each computer’s value is quantified.  The way in which the information value is quantified, or scored, is also configurable.  Operators have the option to select from different scoring approaches, each of which provides a different focus.
The difference between this cyber-centric classification system and that of ediscovery is that here the metric in question is not relevance but organizational risk.  Classes and scores would be utilized to the inform information management resources as to where those documents reside that, if exfiltrated, would damage the organization in some significant way (e.g. violating privacy laws, creating reputational damage, or disclosing trade secrets).   Data mapping and scoring information repositories would provide information to improve both proactive intrusion prevention efforts as well as post-breach damage analysis.  It would be apropos  to reference an ediscovery-like acronym to describe this cyber security predictive text analytics approach. I propose TARM: Technology Assisted Risk Management.

The approaches, technologies and experience gained in the ediscovery area could benefit greatly cyber efforts of this type.  The capabilities and limitations of TAR techniques in classifying documents has had several years