Unlock the Power of Indian Languages with Our Open-Source Multilingual Dataset

Explore Audio-Labeled And Transcribed Data In Various Languages, Designed For AI Models, Product Development, Speech Processing Tools, Language Research, And Other Advanced Applications.

Languages
10 +
Total Data Duration
4702.68 (hr) +
Total Transcription Duration
11161.0 (hr) +
Male Speakers
511 +
Female Speaker
435 +
Total Files
3,82,040 +

Audino Multilingual Data Collection

Audino’s multilingual data collection offers a comprehensive resource across various Indian languages. It includes raw and labeled audio data, highlighting key details such as total recording hours, transcription duration, and speaker distribution. This dataset is ideal for training speech recognition models, with a balanced representation of male and female speakers in each language.

Map of India showing multilingual data coverage
Map of India representing the multilingual data collection coverage
LanguagesRaw DataLabeled DataData Duration (Hour)Transcription (Hour)Male SpeakersFemale Speakers Files Metadata (Hour)
HindiYesYes512.14 +205.05 +586340623 +_
BengaliYesYes615.59 +115.79 + 182852462 +272.24 +
GujaratiNoYes96.37 +205.05 +44356202 +_
KannadaYesYes349.21 +165.79 +532625274 +112.08
MalayalamYesYes643.23 +147.26 +122046925 +209.46 +
MarathiYesYes402.62 +185.19 +826134369 +155.75 +
OdiaYesYes430.58 +_103241845 +_
PunjabiYesYes372.14 + 136.87 +657735083 +145.98 +
TamilNoYes714.24 +_1164261573 +_
TeluguYesYes566.56 +_535137684 +_

Acknowledgments & Attribution

We extend our sincere gratitude to all contributors, researchers, and organizations who have made this dataset possible.

Contributors

Research Institutions

  • National University of Singapore (NUS)
  • University of Buffalo (UB)
  • National Institute of Informatics (NII)

Language Experts

  • Dr. Rajiv Ratn
  • Dr. Rushali
  • Dr. Yaman K Singla

Data Collection Teams

  • North India Data Collection Team
  • South India Data Collection Team
  • East India Data Collection Team

Open Source Community

  • Mozilla Common Voice
  • OpenSLR
  • Indian Language Technology Community

Data Sources

Public Domain Resources

  • Indian Government Public Datasets
  • Public Domain Audio Collections
  • Open Educational Resources

Licensed Datasets

  • Commercial Speech Databases
  • Research Institution Datasets
  • Educational Institution Collections

Community Contributions

  • Crowdsourced Audio Recordings
  • Volunteer Transcriptions
  • Community Language Projects

Research Collaborations

  • Academic Research Projects
  • Government Research Initiatives
  • Industry-Academia Partnerships

Get in Touch

For more information or inquiries about our dataset, feel free to reach out. Click the button below to visit our Contact page and connect with us directly.