This article was originally published in Intelligence Community News on February 3, 2020.
Enterprise-scale data management and engineering best practices in support of advanced analytics
The Intelligence Community (IC) has placed the highest priority on the continuous delivery of valuable analytic insights, whether derived from traditional analytic methodologies, AI and machine learning (ML), or human analysis. To support that mission, IC agencies need to operationalize the data stream, make data discoverable with trusted lineage, and treat data itself as an asset much like analysis. This is the process called “DataOps.”
“Our customers have told us they want to ‘reduce the cost of curiosity,’” said Justin Nall, vice president of Intelligence at NT Concepts. “Operationalizing data as an asset enables this for analysts and technologists; unlocking data that already exists in their organization to solve mission problems.”
Currently, data is trapped in silos across the enterprise. It is also unlabeled, stored in disparate formats, and not transformed to be conducive for ingest into advanced analytic pipelines. DataOps is a process that removes the barrier to entry for analysts and technologists for data to building advanced analytic solutions.
“Simple can be harder than complex. You have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.” – Steve Jobs
For advanced analytics, software development, or AI/ML solutions, the most foundational elements are data. Many analytical missions are slowed by data discovery, that is, just figuring out what you have.
To be useful, data need to be three things:
- Discoverable. Data needs to be tagged (e.g., common or similar metadata tags) and stored in common formats (e.g., Excel files, json, delimited file formats, etc.).
- Accessible. Data needs to be easily callable and pullable from their respective storage locations without heavy access controls (to the extent possible), egress charges, or bandwidth limitations.
- Intelligible. Data needs to be internally formatted in accordance with an intelligible schema such that when a data file is opened, its contents are comprehensible.
Despite these simple data governance standards, there are several ways in which the data themselves can be incorrect. Custodians of enterprise-scale datasets or newly generated datasets should ensure that their data are:
- Comprehensive. Data are considered comprehensive if there are no missing variables or attributes. For example, if data are stored in a columnar format and entities are listed in rows, the columns represent their attributes. A comprehensively descriptive dataset is likely to have more than one or two attributes.
- Complete. Data are considered complete if there are no missing values. Perhaps a comprehensive dataset is comprised of several thousand rows and ten or so attribute columns, but numerous cells in the dataset are empty or labeled NULL. This is an example of an incomplete dataset.
- Accurate. Data are considered accurate if the contents of each cell, continuing the example above, are free of approximations, transcriptions errors, rounding errors, or formula errors.
Simplicity and the handling of data as an asset is more a requirement of organizational culture than it is one of tools and technology. Ultimately, the framework for data governance and dataset maintenance referenced above is what will drive data simplicity within an organization. It will facilitate ease of use for both data access and usage.
Build towards generalizability
A common challenge with current AI/ML practice is the development of one-off, purpose-built algorithms that are: 1) not deployable within their intended environment, 2) trained on data that are fundamentally different than the data on which they are expected to make predictions, and 3) built to solve a very specific problem and, when that problem evolves, quickly become out-of-date.
The lesson that AI/ML users now have to learn is one of age-old scientific process: generalizability.
The most powerful solutions are those that are extensible to more than one use case and applicable in more than one situation. This ties in nicely to the idea of simplicity, because the most elegant solutions will be those that do not overfit (i.e., are not over-engineered to the specific problem they are trying to solve). These foundational solutions become building blocks for other technologists to fork them and mature them further. It is an art for a technologist to be able to identify the foundational elements of their solution and package their code along with the more mature solution purpose-built with a specific problem in mind.
How DataOps works with MLOps and DevSecOps
An end-to-end solution requires a combined understanding and seamless integration of the data, technology, and science. By integrating the operationalization of data, development, and machine learning, a continuous and predictable production-level capability of delivery can be achieved by harnessing organizational data at the speed of mission through provisioning of cross-functional teams and integration of agile solution engineering pipelines leveraging secure and modular multi-cloud components.
Compatibility between these core elements is required to produce AI/ML that approximates human reasoning and decision-making. In practice, we hire data scientists to create algorithms, software developers to write code, security engineers to protect our systems, and data engineers to serve as the custodians for data. The primary responsibilities of engineers working in the DataOps track are:
- To transform data into formats conducive to ingest into advanced analytic pipelines
- To load data into various development environments
- To ensure data remain discoverable through consistent tagging and categorizing
- To split data into training and validation sets, ensuring representativeness in both cases
- To create, update, and/or maintain schemas and ontologies
- To identify additional data and architect the collection mechanisms to keep advanced analytic solutions fresh
In short, a good data engineer is the data advocate through project discovery, data discovery, scientific modeling, to deployment. The first step to truly treating data as a commodity within an organization is to put a stakeholder at the table that represents it. The practitioner community has learned some hard lessons over the past few years regarding the difficulty of training, testing, and deploying advanced analytic solutions without high-fidelity or appropriately representative datasets.
DataOps is the simple solution.