Building Audino Annotation Tool with the CVAT’s HUMAN-Powered Backend

1. The Strategy: Starting with the Workforce

Our goal was more than just building another annotation tool. We envisioned a complete, scalable audio data labeling ecosystem. Any such ecosystem has two fundamental pillars: the software (the tool) and the people (the workforce). Most projects make the mistake of focusing only on the software, leaving the critical question of "Who will do the work?" as an afterthought.

We decided to solve the people problem first.

Our strategy was built around HUMAN Protocol, a decentralized framework for orchestrating a global, on-demand workforce. By committing to HUMAN Protocol from day one, we knew we could solve scalability, payment, and quality control for our annotators.

This strategic decision immediately defined our primary technical requirement: we needed a backend that was natively built to leverage HUMAN Protocol.

What We Needed in a Backend

With HUMAN Protocol as our cornerstone, we established a clear set of non-negotiable requirements for our backend system:

Native HUMAN Protocol Integration : This was the #1 priority. We needed a system where the connection to the workforce was a core feature, not a hacky add-on.

Proven, Enterprise-Grade Stability : The system had to be robust, scalable, and battle-tested. We couldn't risk building our vision on a fragile or unproven platform.

Modular and Decoupled Architecture : Since we were building a tool for audio, we needed the flexibility to replace the default frontend (which is typically for images/video) without breaking the entire system.

Rich Core Functionality : It had to come with all the essential features out-of-the-box: user management, project orchestration, secure file handling, and a comprehensive API.

Why CVAT

When we evaluated the landscape against these strict requirements, one clear winner emerged: the Computer Vision Annotation Tool (CVAT) backend.

CVAT was the perfect fit because it was the very first major annotation platform built on top of HUMAN Protocol. It met every one of our criteria perfectly:

It offered the deepest, most mature HUMAN Protocol integration available.

Its backend is a proven, enterprise-grade system, trusted by countless organizations for large-scale labeling operations.

Its architecture is famously decoupled, allowing us to use its powerful backend while developing our own custom frontend for audio.

It provided all the core functionalities we needed, saving us thousands of hours of development time.

The path forward was clear. We would use the CVAT backend to power our entire infrastructure, from data storage to its seamless workforce integration, and focus our energy on building the best possible audio annotation interface. This guide outlines the steps of that journey.

2. Understanding the CVAT Backend Architecture

Before we start, let's briefly look at the high-level architecture of CVAT. This will help us understand which parts we will be interacting with.

The CVAT system is comprised of several microservices, typically orchestrated using Docker.

Django Server : The heart of the backend. It exposes a comprehensive REST API for all operations, including user authentication, project management, and annotation data handling.

Database (PostgreSQL) : Stores all metadata about your projects, tasks, labels, and users.

Task Queue (Redis/Celery) : Handles asynchronous and long-running operations like importing large datasets or exporting annotations.

File Storage : A local file system or an object storage service like MinIO where your raw data and annotation files are saved.

Our custom frontend primarily communicates with the Django server's REST API, which manages all the logic related to the database, task queue, and file storage. To support our integration, we’ve also made necessary modifications to the backend where required.

3. The Integration Blueprint: A Step-by-Step Guide

Step 1: Customizing the CVAT Backend

The first step is to get the CVAT backend running without its default frontend. We can achieve this by modifying the standard docker-compose.yml file.

1. Updating the Data Models:
To accommodate audio-specific requirements, we extended CVAT’s core models:

Added fields like audio_duration, audio_file_index, and time_stamps to store essential metadata associated with each audio file.

Introduced validation logic to ensure these fields are correctly populated during upload and maintained throughout the task lifecycle.

Created a new model specifically designed to store automatically generated annotations, enabling a seamless experience for job owners by pre-populating segments based on speech detection or transcription.

2. Customizing the API Endpoints:
We adjusted key API endpoints to ensure they work smoothly with audio data and annotation workflows:

Modified the data upload API to support audio file formats (e.g., .wav, .mp3) and to extract audio metadata during ingestion.

Extended the annotation APIs to handle temporal annotations using start_time and end_time, replacing traditional image-based bounding box structures.

Ensured proper serialization/deserialization of the new fields so that frontend tools can send and receive annotation data in the correct format without breaking CVAT’s existing infrastructure.

3. Creating an Audio Data Ingestion Pipeline:
To optimize the handling of audio data and reduce manual annotation effort, we built a robust ingestion pipeline:

Enhanced CVAT’s Celery task queue to run a Voice Activity Detection (VAD) model during the upload process. This generates preliminary ground truth segments that help job owners kick-start the annotation process with minimal effort.

Implemented logic to automatically associate these pre-annotated segments with the corresponding task and user, storing them in our custom annotation model.

Added a server-side process to generate audio waveform peaks, enabling smooth and efficient waveform rendering on the frontend—particularly important for long audio files where client-side processing would be too slow or memory-intensive.

Step 2: Crafting the Frontend

Our custom frontend is designed to deliver a rich and intuitive user experience tailored for audio annotation. We’ve implemented several key features to support this:

Audio Visualization : We used Wavesurfer.js to render an interactive audio waveform, allowing users to visually explore and interact with the audio.

Playback Controls : Standard play, pause, and scrub functionality is provided for precise navigation.

Annotation Tools : Users can create temporal segments on the waveform, assign labels, and add rich annotations such as transcriptions in multiple languages, emotions, gender, locale, accent, and age—all with ease.

Data Import/Export : Users can import existing annotations or export their work in supported formats, enabling smooth collaboration and backup.

Step 3: Interfacing with the CVAT REST API

We integrated the CVAT REST APIs into our custom frontend to enable seamless interaction with the backend. This integration allows the frontend to fetch tasks, submit annotations, and manage user data efficiently. Using axios, we built a set of API clients that handle all necessary requests to and from the CVAT server, ensuring smooth data flow and real-time updates.

4. Key Considerations

Data Mapping : We defined how audio data maps to CVAT's structure. In our case, a "frame" corresponds to a segment or chunk of an audio file, while a "bounding box" is represented as a [start_time, end_time] pair to indicate the annotated region.

Annotation Format : We designed our annotation format to support export compatibility with a wide range of standard datasets commonly used for training speech and audio models. In addition to our native Audino Format, we support exporting to popular formats like Common Voice, LibriSpeech, VoxPopuli, TED-LIUM, VoxCeleb, VCTK Corpus, and LibriVox. This ensures that annotated data can be directly used for model training or seamlessly integrated into existing research pipelines.

Scalability : To handle large volumes of data, we leveraged CVAT’s built-in Celery queue for background processing. Tasks such as importing, exporting, and auto-processing annotations (AI Annotation) are queued through the API to avoid blocking the frontend and ensure scalable performance.

Annotation Validation: We implemented a custom validation function to evaluate the quality of user annotations across different task types, such as transcription, sentiment analysis, and emotion tagging. This validation not only ensures that annotations follow the expected format and label structure but also provides users with real-time feedback on the accuracy and completeness of their work. By highlighting inconsistencies or missing fields, the system helps users improve annotation quality and maintain dataset reliability.

5. Conclusion

Our journey to build Audino started with a clear strategy: to build an ecosystem around a scalable workforce powered by HUMAN Protocol. This strategy dictated our choice of technology, leading us directly to the CVAT backend —the only platform that offered the enterprise-grade stability and native HUMAN integration we required.

By leveraging CVAT's robust infrastructure, we made the development process remarkably easy and efficient. We successfully bypassed the immense challenge of building a scalable backend and were able to focus entirely on what we do best: creating a world-class user experience for audio annotation. This strategic, backend-first approach was the key to our success.

If you’re building in audio AI and need scalable annotation, we’d love to talk.

Explore our work at audino.in
Or get in touch to deploy large-scale audio intelligence solutions.