Unlock the Power of Indian
Languages with Our Open-Source
Multilingual Dataset

Explore Audio-Labeled And Transcribed Data In Various Languages, Designed For AI Models, Product Development, Speech Processing Tools, Language Research, And Other Advanced Applications.

Languages: 10 +
Total Data Duration: 4702.68 (hr) +
Total Transcription Duration: 11161.0 (hr) +
Male Speakers: 511 +
Female Speaker: 435 +
Total Files: 3,82,040 +

Audino Multilingual Data Collection

Audino’s multilingual data collection offers a comprehensive resource across various Indian languages. It includes raw and labeled audio data, highlighting key details such as total recording hours, transcription duration, and speaker distribution. This dataset is ideal for training speech recognition models, with a balanced representation of male and female speakers in each language.

Map of India showing multilingual data coverage — Map of India representing the multilingual data collection coverage

Languages	Raw Data	Labeled Data	Data Duration (Hour)	Transcription (Hour)	Male Speakers	Female Speakers	Files	Metadata (Hour)
Hindi	Yes	Yes	512.14 +	205.05 +	58	63	40623 +	_
Bengali	Yes	Yes	615.59 +	115.79 +	18	28	52462 +	272.24 +
Gujarati	No	Yes	96.37 +	205.05 +	44	35	6202 +	_
Kannada	Yes	Yes	349.21 +	165.79 +	53	26	25274 +	112.08
Malayalam	Yes	Yes	643.23 +	147.26 +	12	20	46925 +	209.46 +
Marathi	Yes	Yes	402.62 +	185.19 +	82	61	34369 +	155.75 +
Odia	Yes	Yes	430.58 +	_	10	32	41845 +	_
Punjabi	Yes	Yes	372.14 +	136.87 +	65	77	35083 +	145.98 +
Tamil	No	Yes	714.24 +	_	116	42	61573 +	_
Telugu	Yes	Yes	566.56 +	_	53	51	37684 +	_

Acknowledgments & Attribution

We extend our sincere gratitude to all contributors, researchers, and organizations who have made this dataset possible.

Contributors

Research Institutions

National University of Singapore (NUS)
University of Buffalo (UB)
National Institute of Informatics (NII)

Language Experts

Dr. Rajiv Ratn
Dr. Rushali
Dr. Yaman K Singla

Data Collection Teams

North India Data Collection Team
South India Data Collection Team
East India Data Collection Team

Open Source Community

Mozilla Common Voice
OpenSLR
Indian Language Technology Community

Data Sources

Public Domain Resources

Indian Government Public Datasets
Public Domain Audio Collections
Open Educational Resources

Licensed Datasets

Commercial Speech Databases
Research Institution Datasets
Educational Institution Collections

Community Contributions

Crowdsourced Audio Recordings
Volunteer Transcriptions
Community Language Projects

Research Collaborations

Academic Research Projects
Government Research Initiatives
Industry-Academia Partnerships

Get in Touch

For more information or inquiries about our dataset, feel free to reach out. Click the button below to visit our Contact page and connect with us directly.

Unlock the Power of Indian Languages with Our Open-Source Multilingual Dataset