Intelligent Data Labeling and the Future of AI

Sep 1, 2021

By its nature, deep learning (the most useful type of machine learning when it comes to complex tasks) consumes massive volumes of data.  This includes target data, data that it uses to refine algorithms and inferences, and initial training data.  Of these, the initial training data is in many ways the most valuable, and all too often the most difficult to acquire, refine, and apply.

Why is this?

Machine learning systems use training data to initially map out the domain in question (for example, "This is a verb, this is a giraffe, these are buildings, these are points that cluster closely together, these are points that never occur together, etc.") in order to develop sophisticated models of relationships and make inferences based on those models.  

High-quality training data makes it easier for the system to refine and extend its understanding of the domain as it processes an increasing quantity of data later on.  Data of lower quality may extend the time required to develop an in-depth understanding of the domain, and may in fact introduce difficult-to-detect errors or biases that can cause serious problems later on.

Quality - At a Price

An ideal training data set, then, might consist of a large and well-curated set of data points completely and accurately labeled by human subject-matter experts.  Unfortunately, per-data-point cost for such a training set is likely to be high, particularly in domains where experts are relatively scarce and in demand – and in practice, those experts may have very limited time to devote to the task, even if the money is available to pay them.  This is true, for example, in high-demand fields such as security- defense-related satellite image analysis, as well as medical diagnosis involving potentially life-threatening conditions, and other high-priority mission-critical tasks.  The more dependent a domain is on expert knowledge, the more likely it is that limited access to subject-matter experts will become a bottleneck.

Getting Past the Bottleneck

But it is often just such domains where AI systems that use machine learning are likely to be of greatest value, since they can provide greater access to subject-matter expertise while allowing experts to make greater use of their own knowledge.  As is so often the case, there is more than enough payoff to justify the considerable effort that it takes to get past the bottleneck.  So what's the best way to do this?

Using AI to Build AI

That's where intelligent data labeling comes in.  The idea is simple – you use AI technology to label data with only limited access to human experts, make effective use of non-expert labeling, or eliminate the need for human-mediated labeling entirely.  This makes it possible to bootstrap a seemingly inadequately-labeled data set or a stream of raw data into a workable training set for a machine learning system – which in turn makes it possible to get expert systems off the ground and working in domains where they are in demand but have not previously been available.

What are the most common techniques for intelligent labeling?  Here's a quick survey:

Active Learning: Using Experts Efficiently

Active learning does require the ongoing participation of an expert (whether human or machine), but only on a limited basis.  

In active learning, the system starts with a mix of labeled and unlabeled data.  It first uses the labeled data to make inferences about the unlabeled data.  It then selects a subset of the unlabeled data and asks the expert to label the items in that subset.  The system checks its initial inferences against the labels that the expert provides, and refines its inferences based on that information.  The process can proceed iteratively, until the system reaches a target level of accuracy.

There are several criteria that an active learning system can use for selecting the items to be queried.  It could, for example, pick the data that it is the least certain about, or it could choose the items that would do the most to reduce error if labeled correctly.  Other strategies compare multiple models, apply statistical selection, or request labels for those items that are the most representative of the model or the most likely to result in the most change.  The overall goal of all of these approaches, however, is to maximize the accuracy of the models being developed while making the most efficient use of available sources of expertise at the same time.

Weak Supervision: Making Inadequate Labels Work

If high-quality expert labeling is hard to come by, what can you do with low-quality, less-than-expert (but cheaper and easier to get) labels?  As it turns out, quite a lot.

Low-quality labeling comes in three basic varieties (which are often found together).  It may be incomplete, inaccurate, or inexact.  The challenge in all three cases is to produce high-quality training results while using such unpromising training data.  The strategies for doing this depend to some degree on the specific labeling problems:  

Incomplete - Only a Few Items Labeled

If you have access to expert input on a part-time basis, then active learning may be the most effective way to deal with incomplete data.  In the absence of expert input, however, you can use statistics-intensive mathematical analysis to fit unlabeled data into patterns based on labeled data, identify boundaries between data item clusters, or compare the effectiveness of multiple models based on subsets of the available data.  In most cases, statistical analysis is best at placing unlabeled data items in relation to labeled items or to each other.

Unreliable Labeling

Crowdsourcing may be a good method for obtaining a large volume of labeled data at a minimal cost within a relatively short time, but the data often has little or no quality control.  Data items may be labeled by people (or bots) who have inadequate or no understanding of the domain, do not care about accuracy, are not paying sufficient attention to the task, or who deliberately mislabel items.  You may encounter even greater accuracy problems with data that is automatically scraped from online sources, such as tagged social media images.  

The challenge is to eliminate as much of this built-in inaccuracy as possible.  If you can track individual labellers and the data that they provided, you may be able to use statistical methods to identify most high-volume malicious actors and discard their work.  With crowdsourced data, if each data item has been labeled by a sufficiently large number of different individuals, you can often select the most appropriate label by majority vote.  In other cases, node-based graphs may allow the system to detect and eliminate data points that do not fit in with their surroundings.  

Low-Resolution Labels

Labels can also be inexact, providing information that is accurate but lacks sufficient granularity.  If, for example, you want to label all images that contain a specific object, and the existing labeling data does not say explicitly whether it is in a given image, you can treat each image as a collection of objects (instances), then use heuristic (approximate but sufficient) rules to predict which images contain the object in question.  

This technique can be applied to items containing discrete (such as medical records or descriptions of chemical systems), semi-discrete (raw text), or continuous (images or audio streams) data, and depending on the type of data and available information, it may be able to locate the target instance within each bag.

Other Strategies

There are a variety of other strategies for intelligent labeling.  

You can, for example, use a small amount of high-quality labeled data to generate a model that you then use to label a larger volume of training data.  

You may also be able to initially train the system on a large volume of data from a domain that is similar to the target domain, or that is within the domain but contains an unacceptable level of noise, and then refine the model using a smaller volume of low-noise, domain-specific data.  

In many cases, the nature of the available training data will dictate a weak supervision approach, but you will have sufficient access to expert knowledge to make at least some use of active learning.  You can also combine these techniques with other approaches based on your specific requirements.  A platform such as Watchful, which uses a wide range of machine learning / deep learning techniques, makes it easy to develop a coordinated combination of methods that is closely suited to your data-labeling needs.

Put Your Best Use Cases First

And the bottom line?  Maybe it's this:

You do not need to let the availability of data sets that have good labels (or that can be easily labeled) dictate which domains and tasks get the benefit of your artificial intelligence firepower.  Instead, you can focus your ML/AI effort on the use cases that are the most valuable, using intelligent labeling techniques to get your system up to speed – even when your initial training data appears to be less than promising at first glance.  

You can control the training data, rather than letting it control you.

Shayan Mohanty
Shayan Mohanty

Shayan Mohanty is the CEO and Co-Founder of Watchful. Prior to Watchful, he has spent the last decade leading data engineering teams at various companies including Facebook, where he served as lead for the stream processing team responsible for processing 100% of the ads metrics data for all FB products.