Nestled just west of the Great Smoky Mountains in Tennessee is the Department of Energy’s Oak Ridge National Laboratory. Within that sprawling lab facility is the Oak Ridge Cyber Analytics (ORCA) Team. The ORCA development philosophy as stated on its website is as follows:
1. Focus on challenging problems in the cyber security domain for which there are gaps in available technology.
2. The speed, volume, and complexity of cyber security data has outstripped our ability to defend systems with manually intensive processes.
ORCA is designed to provide an adaptive, accurate, and reliable analytic infrastructure for cyber security.
The ORCA tool kit contains several components. Of particular importance to those who are involved in analytical applications in the discovery realm is the Network Data Discovery Engine. The stated objective of the tool is that should sound somewhat familiar to TAR practitioners:
Mapping the distribution of textual data on a network, including quantifying the value of the information each host contains.
The conceptual similarity is obvious. The NDDE scores documents based upon an automated or semi-automated supervised learning classification system:
The ORCA Asset Valuator is highly configurable. Operators have the flexibility to customize the information categories used for scoring. These information categories are lists of terms and phrases that characterize a document. For example, an information category called “Anatomy” might be characterized by terms such as “body”, “structure”, or “morphology”. Additionally, the “Anatomy” category might include terms for all of the human body parts and organs. An information category is simply a collection of terms and phrases that have a common theme. In the Asset Valuator, the information categories provide the basis for discriminating the kind of information on each computer and the way in which each computer’s value is quantified. The way in which the information value is quantified, or scored, is also configurable. Operators have the option to select from different scoring approaches, each of which provides a different focus.
The difference between this cyber-centric classification system and that of ediscovery is that here the metric in question is not relevance but organizational risk. Classes and scores would be utilized to the inform information management resources as to where those documents reside that, if exfiltrated, would damage the organization in some significant way (e.g. violating privacy laws, creating reputational damage, or disclosing trade secrets). Data mapping and scoring information repositories would provide information to improve both proactive intrusion prevention efforts as well as post-breach damage analysis. It would be apropos to reference an ediscovery-like acronym to describe this cyber security predictive text analytics approach. I propose TARM: Technology Assisted Risk Management.
The approaches, technologies and experience gained in the ediscovery area could benefit greatly cyber efforts of this type. The capabilities and limitations of TAR techniques in classifying documents has had several years