Nova-3 Expands Speech-to-Text Capabilities With Support for Thai, Cantonese, Mandarin, and Indic Languages
For most of the history of commercial speech recognition, language support lists were dominated by European languages, with English comfortably at the top and everything else trailing at varying distances. The practical consequence was simple and significant: roughly half the world's population, whose mother tongues include Mandarin, Hindi, Bengali, Tamil, Cantonese, and Thai, could not access the quality of ASR that English speakers took for granted. Building a voice AI application for a Thai-speaking user base or a Cantonese customer service line meant accepting substantially higher error rates or routing through expensive, lower-quality workarounds.
Deepgram's Nova-3 model represents a meaningful step toward closing that gap. With expanded language support now including Thai, Cantonese, Mandarin, and a growing suite of Indic languages, Nova-3 brings the accuracy and speed that made its predecessors competitive for English into markets where the demand for capable, local-language ASR has long outpaced the available supply. This is not a minor update to a language list. It is a deliberate architectural and data investment targeting languages that represent hundreds of millions of speakers and enormous untapped market potential for voice AI applications.
This article explains what the Nova-3 expansion covers, why these specific language additions are technically significant, what it means practically for developers and businesses building voice AI products in Asian and South Asian markets, and how this fits into the broader competitive movement among ASR providers toward genuine multilingual capability.
What Nova-3 Is and Why It Matters
Understanding the significance of this expansion requires some context on what Nova-3 is and where it sits in the ASR provider ecosystem.
Deepgram's Nova Model Line
Deepgram has positioned itself as one of the leading ASR API providers for production applications, competing directly with Google Cloud Speech-to-Text, Microsoft Azure Speech, and Amazon Transcribe. Its Nova model line has been specifically optimized for real-world use cases: low latency, high accuracy on conversational audio (not just clean studio recordings), and competitive pricing at production call volumes. Nova-2, its predecessor, established strong performance benchmarks for English and a limited set of other languages. Nova-3 extends that foundation significantly.
What distinguishes Nova-3 from earlier Deepgram models is not just additional language coverage but improvements in how the model handles the specific acoustic and linguistic challenges of the newly supported languages, challenges that are substantially different from the challenges of European language ASR and that cannot be addressed simply by adding more training data in the target language to an existing model architecture.
The Market Context for This Expansion
The languages added in Nova-3's expansion represent extraordinary aggregate speaker populations. Mandarin Chinese has approximately 920 million native speakers, making it the world's most spoken native language. The Indic languages collectively serve populations in India where over 1.4 billion people speak languages including Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, and Kannada as primary languages. Thai is spoken by approximately 60 million people in Thailand and diaspora communities. Cantonese, distinct from Mandarin in its phonology and vocabulary, is the primary spoken language for tens of millions in Hong Kong, Guangdong province, and overseas Cantonese communities globally.
Why Previous Models Underserved These Markets
Tonal languages like Mandarin, Cantonese, and Thai present ASR challenges that differ fundamentally from non-tonal languages. In Mandarin, the same phonetic syllable pronounced with four different tones represents four entirely different words, meaning tone recognition is not a secondary feature but a core component of accurate word identification. Indic languages present a different but equally significant challenge: rich morphological complexity, a wide range of scripts, and substantial within-language dialectal variation that means a Hindi ASR system trained primarily on standard Delhi Hindi may underperform significantly on the spoken Hindi of Bihar, Rajasthan, or Uttar Pradesh.
"Expanding ASR to tonal and morphologically complex languages is not a matter of collecting more training data. It requires architectural choices at the model level that account for how these languages encode meaning differently from the languages that dominated early ASR research."
The Technical Challenges of Tonal and Indic Language ASR
The specific languages added in Nova-3's expansion share some common ASR challenges, as well as language-family-specific ones that are worth understanding separately.
Mandarin: Tones, Homophones, and Character Mapping
Mandarin Chinese ASR faces three compounding challenges. First, accurate tone recognition is required for basic word identification accuracy, since without it a large proportion of syllables are ambiguous among several possible words. Second, Mandarin has an extremely high proportion of homophones relative to European languages, meaning even correct phonetic transcription still requires disambiguation using context. Third, the output of a Mandarin ASR system typically needs to be in Chinese characters (simplified or traditional) rather than phonetic romanization, adding a phoneme-to-character mapping step. Industry benchmarks have historically shown that even leading Mandarin ASR systems achieve character error rates 2 to 3 times higher than English word error rates on comparable clean speech, reflecting the genuine additional difficulty of the task.
Cantonese: Distinct From Mandarin in Ways That Matter
A common assumption outside the Chinese-speaking world is that Cantonese and Mandarin ASR are similar problems. They are not. Cantonese has six to nine tones depending on classification, compared to Mandarin's four, making tonal accuracy even more critical. Cantonese also has a distinct vocabulary, distinctive code-switching patterns with English (particularly in Hong Kong speech), and a complex relationship between spoken Cantonese and written Chinese that means Cantonese ASR cannot simply reuse Mandarin ASR architecture. Building strong Cantonese ASR requires dedicated training data and model tuning, not adaptation of a Mandarin system.
Thai: Tone Language With No Word Boundaries
Thai is tonal with five tones and presents an additional challenge that has no parallel in European languages: Thai script does not use spaces between words. Word segmentation, determining where one word ends and the next begins, is itself a distinct NLP task that must run before or alongside phoneme recognition in a Thai ASR pipeline. This combination of tone recognition, word segmentation, and the complex consonant cluster phonology of Thai makes it one of the technically demanding languages in the ASR field.
Indic Languages: Diversity Within Diversity
The Indic language family encompasses languages across multiple sub-families, each with distinct phonological systems, script systems, and morphological complexity. Hindi and Urdu share spoken phonology but use entirely different scripts (Devanagari and Nastaliq respectively). Tamil and Telugu are Dravidian languages with phonological systems distinct from the Indo-Aryan Hindi family. Bengali has the second-largest native speaker population among Indic languages and its own script. Supporting Indic languages in ASR is not a single technical problem but a set of related but distinct problems, requiring separate training data, script handling, and model tuning for each major language.
How Nova-3's Expansion Compares to the Competitive Field
Nova-3's multilingual push does not happen in isolation. It comes as every major ASR provider is investing aggressively in non-English language coverage, and understanding the competitive context clarifies where Nova-3's expansion is differentiated.
Google Cloud Speech-to-Text and Universal Speech Model
Google Cloud Speech-to-Text, backed by Google's Universal Speech Model trained on 12 million hours of audio across 300 languages, has had broad multilingual coverage including Mandarin, Thai, and several Indic languages for several years. Google's advantage is breadth and the sheer scale of training data. Its disadvantage, for API users building production applications, has historically been latency and pricing compared to specialized providers like Deepgram.
OpenAI Whisper and Its Multilingual Architecture
OpenAI's Whisper, trained on 680,000 hours of multilingual audio scraped from the internet, achieved notably strong cross-language accuracy relative to its training approach, partly because its training data included a more representative distribution of non-English languages than curated commercial datasets. Whisper supports Mandarin and several Indic languages with competitive accuracy, but as an open-source model rather than a managed API, it requires infrastructure for self-hosting that not all production teams want to manage.
Microsoft Azure Speech
Microsoft Azure Cognitive Services Speech supports Mandarin (simplified and traditional), Cantonese, Hindi, and several other Indic languages in its custom and standard recognition offerings. Microsoft's integration with Azure enterprise infrastructure and its long-standing presence in Asian markets give it an advantage in enterprise segments, though its pricing model and API design differ from Deepgram's developer-focused approach.
| Provider | Mandarin | Cantonese | Thai | Indic Languages |
|---|---|---|---|---|
| Deepgram Nova-3 | Yes (new) | Yes (new) | Yes (new) | Expanding (new) |
| Google Cloud USM | Yes (established) | Yes (established) | Yes (established) | Broad (established) |
| OpenAI Whisper | Yes (strong) | Partial | Yes | Selective |
| Microsoft Azure Speech | Yes (established) | Yes (established) | Yes | Yes (several) |
Practical Impact: What This Means for Developers and Businesses
The technical achievement matters, but the practical question is what Nova-3's multilingual expansion actually enables for teams building voice AI products in these markets.
Contact Centers in Southeast Asia and India
The largest addressable market for the newly supported languages is customer-facing voice AI in contact centers across Southeast Asia and South Asia. Thailand's contact center industry handles over 500 million customer contacts per year, predominantly in Thai with code-switching to English in business contexts. India's contact center sector, one of the largest globally by volume, handles a mix of English-language international calls and a growing domestic market in Hindi and regional languages. For developers building IVR and voice agent systems for these markets, Nova-3's expansion provides a production-grade API option that previously did not exist at Deepgram's accuracy and latency tier.
Healthcare and Government Applications in India
India's government has made significant investments in voice-based public services, including telemedicine and citizen service lines that need to handle regional language diversity. The Indic language expansion in Nova-3 is directly relevant to these applications, where the ability to accurately transcribe a patient speaking in Tamil or Bengali rather than English or standardized Hindi is the difference between a functional service and one that excludes large segments of the population it is meant to serve.
Cantonese-Language Media and Education Tech
Cantonese has historically been among the most underserved languages in commercial ASR despite its substantial speaker population and strong economic presence in Hong Kong and southern China. Education technology, media captioning, and customer service applications for Cantonese-speaking markets have either relied on lower-accuracy Mandarin systems or expensive, slow human transcription. Nova-3's dedicated Cantonese capability addresses a genuine market gap that competitors have not fully closed.
Code-Switching: The Real-World Complexity Beyond Clean Language Support
In practice, speakers in the markets Nova-3's expansion targets rarely speak pure, single-language audio. Code-switching, the natural mixing of languages within a single conversation, is a pervasive reality that clean language benchmarks do not capture.
The Hong Kong Cantonese-English Mix
Hong Kong Cantonese is notable for extensive English code-switching at both the word and phrase level. A typical Hong Kong customer service call might switch between Cantonese and English multiple times within a single sentence, using English product names, technical terms, and even grammatical structures embedded in Cantonese speech. ASR systems that handle Cantonese and English independently, with the user required to declare a single language for the session, fail on this real-world speech pattern. Nova-3's Cantonese support needs to, and is designed to, handle this bilingual reality rather than treating it as an edge case.
Hinglish and Indian English Code-Switching
In India, code-switching between Hindi (or other Indic languages) and English is so common that Hinglish (Hindi-English mixed speech) is effectively its own recognized spoken register. Business conversations, customer service calls, and casual conversation all involve extensive mixing, with technical and business vocabulary often supplied in English regardless of the primary language of the conversation. Indic language ASR that does not handle this mixing will underperform significantly on the actual speech it encounters in production deployments.
Thai Formal vs. Colloquial Register Differences
Thai has significant differences between formal written-language registers and colloquial spoken Thai, particularly in vocabulary and particle usage. An ASR system trained predominantly on formal or broadcast Thai will underperform on the casual, conversational Thai that customer service and voice AI interactions typically involve, just as an English ASR system trained on news broadcasts would struggle with casual phone conversation. Real-world deployment performance on colloquial speech, not benchmark performance on formal speech, is the meaningful quality measure.
| Language | Key Code-Switching Pattern | ASR Implication |
|---|---|---|
| Cantonese (HK) | Frequent English word/phrase insertion | Bilingual model or robust switching detection required |
| Hindi / Indic | English business/technical vocabulary throughout | Must handle English loan words in Indic context |
| Thai | English product names, formal vs colloquial register | Colloquial training data and register flexibility |
| Mandarin (Mainland) | English technical terms in business contexts | Mixed-language handling for technical vocabulary |
What This Means for the Broader Voice AI Ecosystem
Nova-3's language expansion is a single product update, but it reflects and accelerates broader trends in the voice AI ecosystem that affect everyone building multilingual voice applications.
ASR Is No Longer an English-First Technology
The competitive dynamic among ASR providers has shifted. Language coverage is now a first-class competitive dimension, not a secondary feature added after English capability is established. Every major provider, including Google, Microsoft, Amazon, and Deepgram, is investing in non-English languages not as a courtesy to global users but because the growth potential in Asian, South Asian, and Southeast Asian markets is substantially larger than in already-saturated English-language markets. Mobile internet user growth in India alone added over 200 million new users between 2020 and 2024, the vast majority of whom interact with digital services in Hindi or regional languages rather than English.
TTS and Voice Cloning Must Follow
ASR expansion without matching TTS expansion creates half a voice AI application. The ability to accurately transcribe Cantonese or Tamil serves the recognition side of a voice agent, but delivering responses in natural-sounding Cantonese or Tamil requires TTS and voice cloning systems with equivalent language coverage. This is where the TTS side of the ecosystem still has significant ground to close: high-quality neural TTS in Indic languages and Thai exists but remains thinner in variety and quality than English TTS. Platforms like VoxClone AI that focus on voice quality and cloning are part of the broader ecosystem responding to this need, and the demand for multilingual TTS of comparable quality to English will grow as ASR in these languages reaches production maturity. You can explore VoxClone AI's voice capabilities through the Google Play Store.
Benchmark Numbers vs. Production Reality
As with any ASR expansion, developers and businesses evaluating Nova-3's new language support should test against their own production data rather than relying solely on benchmark figures. Deepgram's published benchmarks for the new languages reflect controlled test set performance; actual production accuracy will depend on the specific dialect, recording conditions, topic domain, and code-switching patterns of the application's actual user base. Requesting or constructing a domain-specific test set before committing to a provider choice is as important for these new languages as it has always been for English.
Future Directions: Where Multilingual ASR Goes From Here
Nova-3's expansion is a point in a trajectory, not a destination, and understanding where multilingual ASR is headed over the next two to three years helps frame current capability investments.
Dialect-Aware Models Within Major Languages
Current multilingual ASR models typically support a language as a single monolithic model, handling the official or standard form well and performing unevenly on regional dialects. The next frontier, already being explored in research, is dialect-aware models that identify the speaker's dialect within a supported language and apply more targeted decoding accordingly. For Hindi specifically, where dialectal variation between regions is pronounced, dialect-aware ASR would represent a significant accuracy improvement for users outside the standard spoken Hindi distribution that current models are most heavily trained on.
Real-Time Multilingual Switching
The next important capability beyond accurate single-language support is real-time language identification and switching within a single audio stream, allowing a system to handle a conversation that moves between Thai and English, or between Hindi and Gujarati, without requiring the user to declare a language or switch sessions. This capability is already partially available in some research systems and is expected to reach production reliability for major language pairs within the Nova-3 timeframe of supported languages.
End-to-End Multilingual Voice Agents
As both ASR and TTS reach production quality across these newly supported languages, the feasibility of deploying fully multilingual voice agents, systems that can handle a customer in Thai, Cantonese, Hindi, or English within the same deployment without language-specific model selection, becomes practical rather than aspirational. This architectural capability, handling any user in any supported language within a single agent system, represents the mature vision that current language-by-language model releases are building toward.
Practical Takeaways for Teams Evaluating Nova-3
If you are considering Nova-3 for a multilingual voice AI application in any of the newly supported languages, here is the practical guidance that will determine whether the evaluation leads to a productive deployment decision.
Evaluation Checklist
- Build a test set from your actual production audio, not generic benchmark content, since production accuracy on your specific domain and speaker population is the only number that matters for your application.
- Test code-switching patterns specifically if your application involves the language mixing described above for Cantonese, Hindi, or Thai, since this is where clean-audio benchmarks most significantly overestimate real-world performance.
- Evaluate latency at your expected concurrent call volume, not just single-call latency, since production throughput behavior under load is distinct from demo performance on a single stream.
- Compare against at least one established multilingual provider (Google Cloud STT, Azure Speech, or Whisper) on your own test set to establish a realistic baseline against which Nova-3's performance can be evaluated.
- Plan for TTS to match your ASR language expansion, since a voice AI application that can understand Thai but cannot respond naturally in Thai has solved only half the problem.
What to Watch in the Coming Months
As Nova-3's new language support matures in production, independent benchmarks from developers testing on real-world audio will provide the most reliable signal about where the model genuinely excels and where it still has gaps. The initial release of any new language support typically covers the standard dialect and clean audio reasonably well; dialectal variation and noisy audio performance often improve with subsequent model updates as the provider receives production feedback and more diverse training data. Organizations making significant infrastructure commitments to Nova-3 for these languages should plan for ongoing evaluation rather than treating initial benchmarks as fixed.
Conclusion
Nova-3's multilingual expansion is a meaningful development in the commercial ASR market, bringing Deepgram's production-grade performance characteristics into language markets that have historically been underserved by comparable quality options. The combination of Thai, Cantonese, Mandarin, and Indic languages represents hundreds of millions of speakers and some of the fastest-growing markets for digital voice applications globally.
The technical challenges involved, tone recognition, code-switching, dialectal variation, script handling, are not trivial, and the gap between headline language support and production-quality accuracy on real-world audio is real for every provider entering a new language. The right approach for any team evaluating these capabilities is rigorous testing on actual production audio before making deployment decisions, combined with ongoing monitoring as production data reveals real-world performance across the full diversity of speaker patterns the application encounters.
For the broader voice AI ecosystem, Nova-3's expansion reflects and reinforces a clear direction: multilingual capability is no longer optional for production-grade ASR providers, and the quality bar for non-English languages is rising toward the standard that English speech recognition has long delivered. For developers and businesses building voice AI products for Asian and South Asian markets, the improving supply of high-quality multilingual ASR is the infrastructure prerequisite for the next generation of locally relevant, genuinely accessible voice applications.
#SpeechToText #MultilingualASR #Nova3 #Deepgram #MandarinASR #IndicLanguages #VoiceAI #VoxCloneAI #TextToSpeech #TonalLanguages #GooglePlayStore #CantoneseAI