Thinking Ahead… Achieving Defensible Disposition with Machine Learning-Based Categorization in the Cloud: Interview with Bill Tolson, VP of Marketing, Archive360
In today’s increasingly litigious business environment, Defensible Disposition is becoming an increasingly important concept. Can we set the stage with a quick definition?
BT: Absolutely. Defensible Disposition is the process of disposing of unneeded and valueless information, in a manner that provides information about the disposition process, showing that deleted data was not under regulatory retention requirements, nor was it subject to current or anticipated eDiscovery. In short, it shows that regulatory and legal considerations were taken into account before any data was destroyed.
Simple concept. But, not so simple to put into effect?
BT: True. Defensible Disposition is a relatively simple concept that in reality is challenging to implement. This is due to the fact that it typically involves a large commitment of time and budget, which in turn requires executive – often C-level approval. As this is a very specialized area of data management, today, most companies that pursue a Defensible Disposition strategy/implementation typically start by hiring consultants (expensive). Although, some prefer to rely on current employees (risky). And, if you are leveraging a current employee(s), this time-consuming initiative will take them away from their standard work, which not only affects their productivity but can have a domino effect on the productivity of others across the company. Moreover, relying solely on employees to accomplish the project can mean additional time added, as typically they need to learn the proper regulations, laws and methodologies before they even begin.
Unfortunately, as neither one of these strategies are optimal many companies decide to push the project to the back-burner, “for now.”
Can technology save the day?
BT: Seven or so years ago, predictive coding began making a splash on the eDiscovery scene. Predictive coding is the concept of using machine learning techniques via computer algorithms to train computers to search large data sets looking to determine which data is responsive (or non-responsive) to the eDiscovery request based on training the computer by providing examples of the kinds of data that could be relevant.
The predictive coding of several years ago used a “supervised” machine learning model. This is where a human provides the computer examples of both relevant and non-relevant information and then runs a test cycle to determine what the program got right and wrong in the predictive coding process. The human reviews the computer’s results providing it feedback on what it got right and wrong. This training period can take 5, 10, 25, 50, or more training cycles.
In this way the computer “learns” what data is relevant to the eDiscovery order so that searching, culling, and tagging huge collected data sets for potentially responsive (and/or privileged) information is sped up radically while also raising the accuracy levels. Usually legal personnel (or data scientists) control the training process by specifying relevant criteria and performing the training cycles. Predictive coding (machine learning) can speed the process of discovery review and reduce the cost by 80 to 90%. Over the years the track record of this machine learning in the courts has proven itself and become widely accepted.
The obvious next step in utilizing machine learning was to automate data categorization while raising accuracy over that of manual, individual employee categorization. This concept approaches the long awaited capability that all information governance professionals have been waiting for – to take the categorization of the huge amounts of data employees encounter every day out of their hands and automate it. The issue is that most employees don’t actually have time to categorize, correctly store, and apply retention/disposition policies to their data every day thus causing the huge stores of unstructured data clogging up enterprise storage systems. Clearly, machine learning-based categorization will produce consistency and much higher accuracy over that of manual categorization.
When do you foresee machine learning-based categorization becoming available?
BT: In January of 2015, Microsoft acquired Equivio, a provider of machine learning technologies for eDiscovery and information governance. Over the next couple of years, Microsoft embedded this machine learning technology into its Office 365 cloud platform in its E5 license which offers predictive coding capability for discovery of Office 365 data.
This year, Microsoft incorporated this machine learning technology into their Azure Cloud platform to enable their Cognitive and Media Services capabilities. The exciting thing about this technology on the Azure platform is now vendors can build Azure applications that utilize machine learning at a much lower cost.
How does this machine learning tie back to Defensible Disposition?
BT: The next logical step in using machine learning is to utilize auto-categorization to determine what data is valueless, a copy, or beyond its retention period to set the basis of Defensible Disposition. Again, Defensible Disposition is the process of disposing of data that is no longer needed for the running of the business and is not subject to regulatory retention nor subject to a current or anticipated legal hold.
Machine learning for defensible disposition can be used in two ways; to categorize and dispose of the huge stockpiles of existing data around the enterprise, and to perform on-going categorization and retention/disposition of live data – to ensure buildup of unmanaged data never happen again.
Earlier, I mentioned that predictive coding for eDiscovery used a “supervised” machine learning model – meaning it relied on human interaction to train it. With the amount of information already sitting in enterprises as well as the sheer volume of live data entering and leaving the enterprise, a supervised machine learning model would not be feasible.
So, what you are describing is an environment in which the computer trains itself?
BT: BINGO. For auto-categorization and defensible disposition to work, a self-learning or “unsupervised” machine learning model would need to be used.
In unsupervised machine learning, there is no training data set or training cycles needed. Essentially the program trains itself based on the data set provided. Unsupervised machine learning opens the door to ongoing auto-categorization and defensible disposition of live data.
The only caveat for this to work is all corporate data must be stored and available centrally so the program can manage it. This means that all employee computers need to be synced, or for laptops, data must be downloaded to a central location on a regular basis. But the benefits far outweigh ignoring the problem. With predictive auto-categorization, the company addresses the problem of huge, unmanaged employee data – typically 80% of all data in an enterprise.
In the near future, unsupervised machine learning and auto-categorization will be the norm. Of course, there is still the question of how expensive will it be…
Like most technologies – early entrants will likely charge a premium. What can business and IT professionals do to manage costs until the competitor pool increases?
BT: To make machine learning capabilities available to all at a low price, cloud platforms like Microsoft Azure will need to offer machine learning capabilities as an included service – in reality, Microsoft has already begun to do this.
Then, organizations will need to seek technology solutions that enhance and provide the specific capabilities they require to fully achieve their data management, storage, protection, security, archiving, disposition and other initiatives. For instance, Archive360 is the first cloud-managed storage and archive solution for compliance and long-term data management built on Azure Cloud Services. Archive2Azure creates a highly secure and low cost, legally compliant enterprise storage repository and archive perfect for the storage and management of records, unstructured data, and legal data sets. Because it’s built on the Azure Cloud, Microsoft’s machine learning technology is already available to all Azure application developers, so auto-categorization and defensible disposition is just around the corner.