Automated Data Labeling for Text Classification

Nov 14, 2022

For machine learning (ML) practitioners, one of the most important and typical tasks is full text classification (FTC), a technique where you assign a set of categories or tags to an entire row of text. The way you do this apparently simple task has a direct impact on how we build and run apps, on social media sites, news, blog posts or online forums, to name a few of the best-known use cases. 

According to recent estimates, humanity is generating multiple quintillions (that’s right, the number with eighteen 0’s!) of bytes every single day. Until recently, using distributed crowds of human laborers was the only approach to understanding, organizing, and filtering this kind of data at scale. But is there a more efficient way for practitioners of ML to enhance user experience and make smart decisions when fast and accurate FTC is a project must?

Data Centric Approach for Text Classification

At the most basic level, AI = Code + Data and much of ML practice has historically been built around the development process of a model iterating on the architecture, training procedures, feature engineering and so on. In this scenario, we consider the data to be a fixed component, and only focus on the model to improve performance. Here, initial data labeling and model accuracy iterations are traditionally manual tasks. 

The more recent data-centric approach focuses on systematically improving the quality of the datasets to improve the accuracy of the machine learning model output. This works when you deal with small datasets that rarely change. But what to do when you need representative samples for huge amounts of long tail data or when the data becomes obsolete by the time the sample is perfected? 

Challenges with Manual Approaches

Before describing a more efficient approach to FTC, it’s worth exploring what exactly makes manual approach to data labeling a problem worth solving. Probably the greatest hurdle most ML practitioners can instantly relate to is finding and scaling your Subject Matter Experts (SMEs). It is often impossible to consistently outsource your manual labeling for the simple reason that knowledge transfer from your SMEs to the labelers is extremely difficult to handle. ML teams must coordinate the haunting task of building training materials, QAs, procedures, etc. for labelers who are likely living on different continents. 

Even when the data is not very specialized, you can safely assume there will be a level of inconsistencies in dataset creation. It can come from multiple sources: labelers’ personal biases, staff churn, delays arising from batch processing, to name a few. Also, large datasets on a fixed budget usually involve longer execution times which cannot keep the pace with inherent model drifts, especially when data velocity is high.

The Programmatic Approach

Programmatic labeling is the process of writing programs that assign labels for parts of your dataset and applying them to your machine learning project. The process starts by selecting the parts of the dataset that are related - directly or indirectly - to the labels we want to produce and/or deduce. 

Instead of relying on just the data scientists and software developers, or even outsourced labelers, it is much more efficient to leverage Subject Matter Experts (SMEs) to process the data. In order for them to rapidly deploy their own purpose-built AI, a new approach is needed for data discovery, tooling, automation and validation, according to Jaidev Amrite (SparkCognition).

Programmatic labeling can be a good fit for your use case if you are dealing with a large amount of data (tens of thousands of rows and above) that require some level of expertise to label and that change at a relatively high rate to warrant a solution that doesn’t add delays in the labeling process. Of course, data scientists and ML engineers can write their own labeling functions from scratch, but this trial and error approach takes a lot of time and resources.

Developing a Guided Labeling Experience

Rather than gaining knowledge from data alone, we can have the SMEs teaching the machine. They can decompose any problem into smaller parts and provide examples to the algorithm to learn the task on its own, enabling an explainable taxonomy that is a proxy for deep learning models. 

Let’s see how this works. 

First, we upload a set of rows and develop predictive labeling functions to transform raw data into training data by exploring patterns in the data. The key is having an easy-to-use interface focused on making data labeling more efficient and pleasant. Instead of coding functions from scratch or writing regex, SMEs introduce labels by hand into the suggestion engine, reverse engineering the right labeling functions so that they match the patterns in the hand labels. 

Notice the difference: instead of having a person sit down and painstakingly create labeling functions for each of the individual entities, you can have a subject-matter expert sit down and click on “Yes” or “No”. Or they can enrich a few hundred individual rows with metadata and the system could generate predictive labeling functions for the dataset that can be reused as needed, independent of the number of SMEs working on your project. 

This not only enables you to reuse the work in an efficient way on the rest of the dataset, but also offers a recourse, in case you detect something wrong with your data or you want to provide documentation on a model decision pattern. With manual labeling this would mean weeks or months of delays and significant extra costs. 

Programmatic Labeling for Text Classification

The best way to see the value of the programmatic approach to data labeling for text classification is to find real case studies. As mentioned above, the cost and time opportunity must be carefully weighed. My team had the opportunity to put this approach to practice for a number of suitable use cases, bringing measurable results. 

The Wilson Sonsini data science team had a large body of unclassified data that potentially contained critical insights about the types of work the firm provides to its clients, but that was incredibly difficult to tackle with traditional data labeling approaches. They created a prediction algorithm that was eventually applied to the larger set of related data entries. Their SMEs spent two 3-hours sessions with the team to develop an automated pipeline and workflow for newly generated data entries, generating additional insight.

A Proper High is a company focused on normalizing the noisy and fragmented cannabis e-commerce data. Using the Watchful interface, they were able to automate their classification and information extraction. It took one engineer only one afternoon to classify their entire +200,000 product library with greater than 99% accuracy. They reduced their original 30 days estimate to less than 4 hours of effort.

Final Thoughts

There is no “silver bullet” solution to how we approach ML, only constant improvements with each iteration. What we do know today is that the manual labeling approach to large scale training data sets for classifier use cases is an uphill battle against inefficiency, model drift, low quality, human biases and lack of sufficient SMEs.

A programmatic approach to key tasks can be more efficient, both time and money-wise, or it can unlock otherwise untenable use cases. We must find powerful tools and weapons to rapidly explore and identify the best data to train models. The future is giving experts “superpowers” compared to the established way of doing things. One of them is to have an expert label just a few data samples, and let a programmatic set of functions learn from those examples, offering automated training data outputs. 

Shayan Mohanty
Shayan Mohanty

Shayan Mohanty is the CEO and Co-Founder of Watchful. Prior to Watchful, he has spent the last decade leading data engineering teams at various companies including Facebook, where he served as lead for the stream processing team responsible for processing 100% of the ads metrics data for all FB products.