The large volume of available data has spurred a surge in Machine Learning (ML) projects. In the first blog post in the Hidden Challenges of Machine Learning Lifecycle series, we looked at the complexity of clarifying the question to be solved with ML and the process of designing the project, including models and ontology. In this second installment in the series, we look closely at data itself as a resource. To set the context, in the ML space, the term “data” refers to features and any corresponding labels. A feature is an object in the image (say a fire hydrant) and the label is the bounding box and description of the object.
The type and sufficiency of the data available drives the type of model we can build and train. That means that a large quantity of high-quality data is necessary for a model to produce worthwhile results.
Preparing and labeling data for use is arguably one of the most challenging aspects of the ML Lifecycle because, while data may be broadly accessible, it does not necessarily exist in a format that can be fed into an ML model. Specifically, the data may lack proper labels or have restricted accessibility or usage rights.
Not surprisingly, creativity is a vital asset in this data preparation process. The data science team must be able to develop new ways to utilize pre-existing datasets or pre-existing models. In addition to solving the access and use issues, re-using datasets and models can create tremendous efficiencies as the project moves through the three-step process of:
- Data Cleaning and Wrangling
- Feature Engineering
- Data Labeling
In each step of this phase, hidden challenges exist that demand specific tactics. They are in place to manage large amounts of data, build accuracy into the QA/QC process, create labels, and establish auto-consensus methods.
Step One: Data Wrangling
Put simply, data wrangling is the process of collecting, massaging, and manipulating large amounts of data into a format that ML models can ingest. The data wrangling effort includes everything from parsing and consolidating data, to clipping and resizing images, to synthetically labeling or augmenting the data.
Synthetic Labeling / Data Augmentation is any process that artificially creates features or labels to augment training datasets when real data is limited or unavailable. This includes flipping, rotating, cropping, etc. images, using existing data to generate photo-realistic images, and combining data sources in a meaningful way. The process supplements or replaces manually labeled data.
As a general rule of thumb, datasets used for training should have no less than 100 examples per desired output class. This means models with 10 potential classes are going to have a minimum of 1,000 examples total. While it would be nice to use all of these examples to train, the data must be split into training (80%), testing (10%), and optionally an evaluation set (10%). This results in a lot of data, all of which must be stored somewhere. Cloud storage is the recommended data storage option for ML because it keeps individual computers free of memory-hogging datasets and allows for multiple analysts to work with the data.
Because data is rarely found in a uniform format within a single location, different sources of data must be combined and made uniform before any adjustments can take place. No two datasets are the same and ML models tend to be quite picky about the input format. Even pre-labeled datasets often need to be reformatted. Consequently, data wrangling is a continuous process that occurs before and after labeling to allow for adjustments as they are necessary.
Step Two: Feature Engineering
Once the target dataset is collected, confirmed, chunked, and stored, the team can identify and extract relevant intermediate variables. This effort takes multi-dimensional raw data and turns it into feature sets that align with the question or problem established, as well as the models selected to solve it.
Because each problem is incredibly specific, automation is a relatively poor solution. Feature engineering demands significant domain knowledge. It is highly complex, incredibly difficult and time-consuming, and often expensive. The outcomes of this phase are fundamental to making the system work well over the long term.
Step Three: Data Labeling
At this point in the ML Lifecycle, the question is identified, and the requisite data is ingested, categorized, and organized. The team now can begin to create labels. This is the most arduous step in the data preparation phase because it requires pinpoint accuracy and comprehensive QA/QC to minimize cost and maximize output and accuracy. The outcome is thousands of focused training labels for each target and confuser class.
When preparing data for input to a model, often times the classes we define do not represent equal portions of the dataset. When a class represents too little of the dataset, the features of that class may not be identified by the model as well as a class that represents the majority of a dataset. When dealing with this issue, smaller classes can be oversampled, and larger classes can be undersampled. However, this method can result in overfitting of the oversampled class, and reduced performance for the undersampled class. Another method is to synthetically produce extra instances of the smaller class by creating altered samples of existing data.
Data labeling is a cyclical effort that has a multitude of steps, including continuous QA/QC to ensure that the labels are accurate, and consensus techniques that continue to build accuracy into the QA/QC process by calculating object label agreement among multiple labelers. Establishing auto-consensus methods is necessary for quality control. It is a process that engages multiple labelers in labeling the same data so output labels can be compared using a measure of consensus. Labels beyond a certain threshold are re-evaluated. While this increases the labeling time, it decreases the time needed to check the quality of the labels. NT Concepts speeds and ensures the integrity of this process by partnering with vital partners like Figure Eight to turn unstructured data into high-quality training data and Labelbox to get off the ground with the labeling process quickly.
This repetitive task requires pinpoint accuracy to process hundreds of images, PDFs, and text documents. While labeling software and distributed labeling technologies exist, they can be extremely costly and challenging to coordinate. Ideally, a specialized in-house workforce performs the tasks, supported by software that can help to drive speed and accuracy.
Data Preparation Project Use Case
NT Concepts completed a non-funded R&D project centered around expatriate community detection in ground-level imagery. The goal was to identify objects on the city streets that linked the area to a specific group of people. We chose text (the language is key) and flags (the country is key).
We did not want to pay labelers to label the imagery for text and flag detections. Instead, we opted to use a model pre-trained to find “text in the wild,” words on a storefront, for example, which look very different than the text of a piece of paper. Because of this difference, we had to choose the pre-labeled dataset carefully. The model could locate text in an image but not translate or perform language detection. For these tasks, we used another pre-trained model. Crops of text detections fed into the second model, which allowed for translation and language detection.
While there is a distinction between “text in the wild” and text on a piece of paper, pre-labeled datasets exist for both types. We were not this lucky when it came to flags.
Models pre-trained on flags do exist, but the NT Concepts team did not find one that was trained on specific country flags. We labeled the flag dataset using an object detection model pre-trained to find flags and a dataset of country flags broken into different folders by country. The model found the flags, and the folder name was the country name. These factors combined to create synthetic labels for this dataset.
Manual QC and some manual labeling were necessary to alter mislabeled images, but neither was as time-consuming as labeling the full dataset by hand. The images in the flag dataset were distinct dimensions and in different format types. Because the model required them to be the same size and file type, we had to reformat each to meet the model’s requirements.
The process did not yield enough examples of each country flag. To overcome this issue, we used synthetic augmentation to flip and add white noise to the image, which tripled the amount of data available to train and test. We used this final dataset to locate flags in the ground level imagery and identify which country they represent.
Once the Data Labeling phase is concluded, the team can move into the third phase in the ML Lifecycle, during which the data is fit to models and cloud architecture is established. We will delve into the efforts and hidden challenges inherent in the considerations of cost, scale, technical limitations, and implementation demands in the next post.