Architecting Effective Data Labeling Systems for Machine Learning Pipelines

Machine learning models are trained on massive datasets in which each data point is labeled to give it context and meaning. This deep dive describes how to build a data labeling architecture from scratch, with a focus on workflow, security, and data quality.


authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Machine learning models are trained on massive datasets in which each data point is labeled to give it context and meaning. This deep dive describes how to build a data labeling architecture from scratch, with a focus on workflow, security, and data quality.


authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Reza Fazeli
Verified Expert in Engineering
11 Years of Experience

Reza is a machine learning engineer specializing in natural language processing and computer vision. At IBM, he developed machine learning algorithms designed to improve text classification and automate model training, innovations that resulted in six patents. Reza has a master’s degree in engineering from the University of Toronto.

Previous Role

Data Scientist

PREVIOUSLY AT

IBMSoundHound AI
Share

The intelligence in artificial intelligence is rooted in vast amounts of data upon which machine learning (ML) models are trained—with recent large language models like GPT-4 and Gemini processing trillions of tiny units of data called tokens. This training dataset does not simply consist of raw information scraped from the internet. In order for the training data to be effective, it also needs to be labeled.

Data labeling is a process in which raw, unrefined information is annotated or tagged to add context and meaning. This improves the accuracy of model training, because you are in effect marking or pointing out what you want your system to recognize. Some data labeling examples include sentiment analysis in text, identifying objects in images, transcribing words in audio, or labeling actions in video sequences.

It’s no surprise that data labeling quality has a huge impact on training. Originally coined by William D. Mellin in 1957, “Garbage in, garbage out” has become somewhat of a mantra in machine learning circles. ML models trained on incorrect or inconsistent labels will have a difficult time adapting to unseen data and may exhibit biases in their predictions, causing inaccuracies in the output. Also, low-quality data can compound, causing issues further downstream.

This comprehensive guide to data labeling systems will help your team boost data quality and gain a competitive edge no matter where you are in the annotation process. First I’ll focus on the platforms and tools that comprise a data labeling architecture, exploring the trade-offs of various technologies, and then I’ll move on to other key considerations including reducing bias, protecting privacy, and maximizing labeling accuracy.

Understanding Data Labeling in the ML Pipeline

The training of machine learning models generally falls into three categories: supervised, unsupervised, and reinforcement learning. Supervised learning relies on labeled training data, which presents input data points associated with correct output labels. The model learns a mapping from input features to output labels, enabling it to make predictions when presented with unseen input data. This is in contrast with unsupervised learning, where unlabeled data is analyzed in search of hidden patterns or data groupings. With reinforcement learning, the training follows a trial-and-error process, with humans involved mainly in the feedback stage.

Most modern machine learning models are trained via supervised learning. Because high-quality training data is so important, it must be considered at each step of the training pipeline, and data labeling plays a vital role in this process.

ML model development steps, data collection, cleaning, and labeling, and model training, fine tuning, and deployment, then collecting data for more tuning.

Before data can be labeled, it must first be collected and preprocessed. Raw data is collected from a wide variety of sources, including sensors, databases, log files, and application programming interfaces (APIs). It often has no standard structure or format and contains inconsistencies such as missing values, outliers, or duplicate records. During preprocessing, the data is cleaned, formatted, and transformed so it is consistent and compatible with the data labeling process. A variety of techniques may be used. For example, rows with missing values can be removed or updated via imputation, a method where values are estimated via statistical analysis, and outliers can be flagged for investigation.

Once the data is preprocessed, it is labeled or annotated in order to provide the ML model with the information it needs to learn. The specific approach depends on the type of data being processed; annotating images requires different techniques than annotating text. While automated labeling tools exist, the process benefits heavily from human intervention, especially when it comes to accuracy and avoiding any biases introduced by AI. After the data is labeled, the quality assurance (QA) stage ensures the accuracy, consistency, and completeness of the labels. QA teams often employ double-labeling, where multiple labelers annotate a subset of the data independently and compare their results, reviewing and resolving any differences.

Next, the model undergoes training, using the labeled data to learn the patterns and relationships between the inputs and the labels. The model’s parameters are adjusted in an iterative process to make its predictions more accurate with respect to the labels. To evaluate the effectiveness of the model, it is then tested with labeled data it has not seen before. Its predictions are quantified with metrics such as accuracy, precision, and recall. If a model is performing poorly, adjustments can be made before retraining, one of which is improving the training data to address noise, biases, or data labeling issues. Finally, the model can be deployed into production, where it can interact with real-world data. It is important to monitor the performance of the model in order to identify any issues that might require updates or retraining.

Identifying Data Labeling Types and Methods

Before designing and building a data labeling architecture, all of the data types that will be labeled must be identified. Data can come in many different forms, including text, images, video, and audio. Each data type comes with its own unique challenges, requiring a distinct approach for accurate and consistent labeling. Additionally, some data labeling software includes annotation tools geared toward specific data types. Many annotators and annotation teams also specialize in labeling certain data types. The choice of software and team will depend on the project.

For example, the data labeling process for computer vision might include categorizing digital images and videos, and creating bounding boxes to annotate the objects within them. Waymo’s Open Dataset is a publicly available example of a labeled computer vision dataset for autonomous driving; it was labeled by a combination of private and crowdsourced data labelers. Other applications for computer vision include medical imaging, surveillance and security, and augmented reality.

The text analyzed and processed by natural language processing (NLP) algorithms can be labeled in a variety of different ways, including sentiment analysis (identifying positive or negative emotions), keyword extraction (finding relevant phrases), and named entity recognition (pointing out specific people or places). Text blurbs can also be classified; examples include determining whether or not an email is spam or identifying the language of the text. NLP models can be used in applications such as chatbots, coding assistants, translators, and search engines.

A screenshot showing the annotation of text data using Doccano, where names, times, and locations are labeled in different colors.
Text Annotation With Doccano

Audio data is used in a variety of applications, including sound classification, voice recognition, speech recognition, and acoustic analysis. Audio files might be annotated to identify specific words or phrases (like “Hey Siri”), classify different types of sounds, or transcribe spoken words into written text.

Many ML models are multimodal–in other words, they are capable of interpreting information from multiple sources simultaneously. A self-driving car might combine visual information, like traffic signs and pedestrians, with audio data, such as a honking horn. With multimodal data labeling, human annotators combine and label different types of data, capturing the relationships and interactions between them.

Another important consideration before building your system is the suitable data labeling method for your use case. Data labeling has traditionally been performed by human annotators; however, advancements in ML are increasing the potential for automation, making the process more efficient and affordable. Although the accuracy of automated labeling tools is improving, they still cannot match the accuracy and reliability that human labelers provide.

Hybrid or human-in-the-loop (HTL) data labeling combines the strengths of human annotators and software. With HTL data labeling, AI is used to automate the initial creation of the labels, after which the results are validated and corrected by human annotators. The corrected annotations are added to the training dataset and used to improve the performance of the software. The HTL approach offers efficiency and scalability while maintaining accuracy and consistency, and is currently the most popular method of data labeling.

Choosing the Components of a Data Labeling System

When designing a data labeling architecture, the right tools are key to making sure that the annotation workflow is efficient and reliable. There are a variety of tools and platforms designed to optimize the data labeling process, but based on your project’s requirements, you may find that building a data labeling pipeline with in-house tools is the most appropriate for your needs.

Core Steps in a Data Labeling Workflow

The labeling pipeline begins with data collection and storage. Information can be gathered manually through techniques such as interviews, surveys, or questionnaires, or collected in an automated manner via web scraping. If you don’t have the resources to collect data at scale, open-source datasets from platforms such as Kaggle, UCI Machine Learning Repository, Google Dataset Search, and GitHub are a good alternative. Additionally, data sources can be artificially generated using mathematical models to augment real-world data. To store data, cloud platforms such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage scale with your needs, providing virtually unlimited storage capacity, and offer built-in security features. However, if you are working with highly sensitive data with regulatory compliance requirements, on-premise storage is typically required.

Once the data is collected, the labeling process can begin. The annotation workflow can vary depending on data types, but in general, each significant data point is identified and classified using an HTL approach. There are a variety of platforms available that streamline this complex process, including both open-source (Doccano, LabelStudio, CVAT) and commercial (Scale Data Engine, Labelbox, Supervisely, Amazon SageMaker Ground Truth) annotation tools.

After the labels are created, they are reviewed by a QA team to ensure accuracy. Any inconsistencies are typically resolved at this stage through manual approaches, such as majority decision, benchmarking, and consultation with subject matter experts. Inconsistencies can also be mitigated with automated methods, for example, using a statistical algorithm like the Dawid-Skene model to aggregate labels from multiple annotators into a single, more reliable label. Once the correct labels are agreed upon by the key stakeholders, they are referred to as the “ground truth,” and can be used to train ML models. Many free and open-source tools have basic QA workflow and data validation functionality, while commercial tools provide more advanced features, such as machine validation, approval workflow management, and quality metrics tracking.

Data Labeling Tool Comparison

Open-source tools are a good starting point for data labeling. While their functionality may be limited compared to commercial tools, the absence of licensing fees is a significant advantage for smaller projects. While commercial tools often feature AI-assisted pre-labeling, many open-source tools also support pre-labeling when connected to an external ML model.

Name
Supported data types
Workflow management
QA
Support for cloud storage
Additional notes
Label Studio Community Edition
  • Text
  • Image
  • Audio
  • Video
  • Multidomain
  • Time-series
Yes
No
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
CVAT
  • Image
  • Video
Yes
Yes
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Supports LiDAR and 3D Cuboid annotation, as well as skeleton annotation for pose estimation
  • Free online version is available at app.cvat.ai

Doccano
  • Text
  • Image
  • Audio
Yes
No
  • Amazon S3
  • Google Cloud Storage
  • Designed for text annotation
  • Supports multiple languages and emojis
VIA (VGG Image Annotator)
  • Image
  • Audio
  • Video
No
No
No
  • Browser-based
  • Supports remotely hosted images
  • Image
No
No
No
  • Browser-based

While open-source platforms provide much of the functionality needed for a data labeling project, complex machine learning projects requiring advanced annotation features, automation, and scalability will benefit from the use of a commercial platform. With added security features, technical support, comprehensive pre-labeling functionality (assisted by included ML models), and dashboards for visualizing analytics, a commercial data labeling platform is in most cases well worth the additional cost.

Name
Supported data types
Workflow management
QA
Support for cloud storage
Additional notes
Labelbox
  • Text
  • Image
  • Audio
  • Video
  • HTML
Yes
Yes
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Professional labeling teams, including those with specialized domain expertise, available through Labelbox’s Boost service
Supervisely
  • Image
  • Video
  • 3D sensor fusion
  • DICOM
Yes
Yes
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Open ecosystem with hundreds of apps built on Supervisely’s App Engine
  • Supports LiDAR and RADAR, as well as multislice medical imaging
Amazon SageMaker Ground Truth
  • Text
  • Image
  • Video
  • 3D sensor fusion
Yes
Yes
  • Amazon S3
  • Data labelers and reviewers provided through the Amazon Mechanical Turk workforce
Scale AI Data Engine
  • Text
  • Image
  • Audio
  • Video
  • 3D sensor fusion
  • Maps
Yes
Yes
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Text
  • Image
  • Audio
  • Video
  • HTML
  • PDF
Yes
Yes
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Multilingual annotation teams, including those with domain expertise, available through WForce

If you require features that are not available with existing tools, you may opt to build an in-house data labeling platform, enabling you to customize support for specific data formats and annotation tasks, as well as design custom pre-labeling, review, and QA workflows. However, building and maintaining a platform that is on par with the functionalities of a commercial platform is cost prohibitive for most companies.

Ultimately, the choice depends on various factors. If third-party platforms do not have the features that the project requires or if the project involves highly sensitive data, a custom-built platform might be the best solution. Some projects may benefit from a hybrid approach, where core labeling tasks are handled by a commercial platform, but custom functionality is developed in-house.

Ensuring Quality and Security in Data Labeling Systems

The data labeling pipeline is a complex system that involves massive amounts of data, several levels of infrastructure, a team of labelers, and an elaborate, multilayered workflow. Bringing these components together into a smoothly running system is not a trivial task. There are challenges that can affect labeling quality, reliability, and efficiency, as well as the ever-present issues of privacy and security.

Improving Accuracy in Labeling

Automation can speed up the labeling process, but overdependence on automated labeling tools can reduce the accuracy of labels. Data labeling tasks typically require contextual awareness, domain expertise, or subjective judgment, none of which a software algorithm can yet provide. Providing clear human annotation guidelines and detecting labeling errors are two effective methods for ensuring data labeling quality.

Inaccuracies in the annotation process can be minimized by creating a comprehensive set of guidelines. All potential label classifications should be defined, and the formats of labels specified. The annotation guidelines should include step-by-step instructions that include guidance for ambiguity and edge cases. There should also be a variety of example annotations for labelers to follow that include straightforward data points as well as ambiguous ones.

An unlabeled dataset is labeled via AI-assisted pre-labeling, labeling by multiple annotators, consensus on the labels, and QA, with the labeled data used for further training.

Having more than one independent annotator labeling the same data point and comparing their results will yield a higher degree of accuracy. Inter-annotator agreement (IAA) is a key metric used to measure labeling consistency between annotators. For data points with low IAA scores, a review process should be established in order to reach consensus on a label. Setting a minimum consensus threshold for IAA scores ensures that the ML model only learns from data with a high degree of agreement between labelers.

In addition, rigorous error detection and tracking go a long way in improving annotation accuracy. Error detection can be automated using software tools like Cleanlab. With such tools, labeled data can be compared against predefined rules to detect inconsistencies or outliers. For images, the software might flag overlapping bounding boxes. With text, missing annotations or incorrect label formats can be automatically detected. All errors are highlighted for review by the QA team. Also, many commercial annotation platforms offer AI-assisted error detection, where potential mistakes are flagged by an ML model pretrained on annotated data. Flagged and reviewed data points are then added to the model’s training data, improving its accuracy via active learning.

Error tracking provides the valuable feedback necessary to improve the labeling process via continuous learning. Key metrics, such as label accuracy and consistency between labelers, are tracked. If there are tasks where labelers frequently make mistakes, the underlying causes need to be determined. Many commercial data labeling platforms provide built-in dashboards that enable labeling history and error distribution to be visualized. Methods of improving performance can include adjusting data labeling standards and guidelines to clarify ambiguous instructions, retraining labelers, or refining the rules for error detection algorithms.

Addressing Bias and Fairness

Data labeling relies heavily on personal judgment and interpretation, making it a challenge for human annotators to create fair and unbiased labels. Data can be ambiguous. When classifying text data, sentiments such as sarcasm or humor can easily be misinterpreted. A facial expression in an image might be considered “sad” to some labelers and “bored” to others. This subjectivity can open the door to bias.

The dataset itself can also be biased. Depending on the source, specific demographics and viewpoints can be over- or underrepresented. Training a model on biased data can cause inaccurate predictions, for example, incorrect diagnoses due to bias in medical datasets.

To reduce bias in the annotation process, the members of the labeling and QA teams should have diverse backgrounds and perspectives. Double- and multilabeling can also minimize the impact of individual biases. The training data should reflect real-world data, with a balanced representation of factors such as demographics and geographic location. Data can be collected from a wider range of sources, and if necessary, data can be added to specifically address potential sources of bias. In addition, data augmentation techniques, such as image flipping or text paraphrasing, can minimize inherent biases by artificially increasing the diversity of the dataset. These methods present variations on the original data point. Flipping an image enables the model to learn to recognize an object regardless of the way it is facing, reducing bias toward specific orientations. Paraphrasing text exposes the model to additional ways of expressing the information in the data point, reducing potential biases caused by specific words or phrasing.

Incorporating an external oversight process can also help to reduce bias in the data labeling process. An external team—consisting of domain experts, data scientists, ML experts, and diversity and inclusion specialists—can be brought in to review labeling guidelines, evaluate workflow, and audit the labeled data, providing recommendations on how to improve the process so that it is fair and unbiased.

Data Privacy and Security

Data labeling projects often involve potentially sensitive information. All platforms should integrate security features such as encryption and multifactor authentication for user access control. To protect privacy, data with personally identifiable information should be removed or anonymized. Additionally, every member of the labeling team should be trained on data security best practices, such as having strong passwords and avoiding accidental data sharing.

Data labeling platforms should also comply with relevant data privacy regulations, including the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), as well as the Health Insurance Portability and Accountability Act (HIPAA). Many commercial data platforms are SOC 2 Type 2 certified, meaning they have been audited by an external party and found to comply with the five trust principles: security, availability, processing integrity, confidentiality, and privacy.

Future-proofing Your Data Labeling System

Data labeling is an invisible, but massive undertaking that plays a pivotal role in the development of ML models and AI systems—and labeling architecture must be able to scale as requirements change.

Commercial and open-source platforms are regularly updated to support emerging data labeling needs. Likewise, in-house data labeling solutions should be developed with easy updating in mind. Modular design enables components to be swapped out without affecting the rest of the system, for example. And integrating open-source libraries or frameworks adds adaptability, because they are constantly being updated as the industry evolves.

In particular, cloud-based solutions offer significant advantages for large-scale data labeling projects over self-managed systems. Cloud platforms can dynamically scale their storage and processing power as needed, eliminating the need for expensive infrastructure upgrades.

The annotating workforce must also be able to scale as datasets grow. New annotators need to be trained quickly on how to label data accurately and efficiently. Filling the gaps with managed data labeling services or on-demand annotators allows for flexible scaling based on project needs. That said, the training and onboarding process must also be scalable with respect to location, language, and availability.

The key to ML model accuracy is the quality of the labeled data that the models are trained on, and effective, hybrid data labeling systems offer AI the potential to improve the way we do things and make virtually every business more efficient.

Understanding the basics

  • What is data labeling in AI?

    Data labeling, or data annotation, is the practice of adding descriptive labels to raw data, giving each of the data points context and meaning. The labeled data can then be used to more effectively train machine learning (ML) models. Annotations can be added via automated or manual labeling.

  • Why is data labeling important?

    Labels provide context and meaning to raw data, enabling machine learning models to learn associations more effectively. Without labels, a model’s interpretation of data might be inaccurate, potentially leading to incorrect predictions and classifications on unseen data.

  • How do you quickly label images for machine learning?

    There are several methods to build an efficient data labeling process. Many platforms offer batch labeling and automated pre-labeling. In addition, clear annotation guidelines can make the workflow more efficient. Expanding your labeling team via outsourcing is also an effective way to label large volumes quickly.

  • What is a data labeling example?

    Object detection and classification in images is a common application of data labeling. On the raw data (an image), bounding boxes are drawn around important objects, and each object is classified and labeled. Image annotation is essential in training ML models for self-driving cars.

  • How do I start data labeling?

    Start by defining your requirements based on the tasks you want the ML model to perform. Next, collect data via manual or automated techniques or open-source datasets. Finally, choose your data labeling platform and build your annotation team; you may outsource via specialized annotation providers.

  • What is labeled and unlabeled data in AI?

    Labeled data is typically used in supervised training, where ML models learn from examples. Unlabeled data lacks the context provided by the labels. It is generally used for tasks like anomaly detection and clustering, where models are trained via unsupervised learning.

Hire a Toptal expert on this topic.
Hire Now
Reza Fazeli

Reza Fazeli

Verified Expert in Engineering
11 Years of Experience

Toronto, ON, Canada

Member since October 19, 2023

About the author

Reza is a machine learning engineer specializing in natural language processing and computer vision. At IBM, he developed machine learning algorithms designed to improve text classification and automate model training, innovations that resulted in six patents. Reza has a master’s degree in engineering from the University of Toronto.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Previous Role

Data Scientist

PREVIOUSLY AT

IBMSoundHound AI

World-class articles, delivered weekly.

World-class articles, delivered weekly.

Join the Toptal® community.