OpenAI rolled out three new voice models on its API platform, positioning the company at the forefront of real‑time conversational AI. The flagship, GPT‑Realtime‑2, delivers what OpenAI describes as GPT‑5‑class reasoning inside a single audio loop, eliminating the need for separate transcription, synthesis and logic components. A separate model, GPT‑Realtime‑Translate, handles live translation across more than 70 input languages and 13 output languages. The third offering, GPT‑Realtime‑Whisper, provides streaming speech‑to‑text with low latency.
Developers who have built voice agents traditionally stitch together a patchwork of services—Whisper or Deepgram for transcription, ElevenLabs or Cartesia for text‑to‑speech, and large‑language models such as GPT‑4 for reasoning. OpenAI’s integrated approach collapses that stack into one model that both listens and speaks while running complex reasoning in‑between. The result is smoother turn‑taking, fewer silent pauses and the ability to fire multiple tool calls in parallel, a capability previously simulated with prompt scaffolding.
GPT‑Realtime‑2 introduces several production‑ready features. Preambles let the assistant say “let me check that” while it contacts back‑end services, keeping the user engaged. Parallel tool calls let the model request several resources at once and narrate which one is active. Recovery behavior surfaces errors instead of freezing the conversation. The model also offers tone‑adjustment knobs, allowing a calmer voice for support scenarios or a more upbeat tone for confirmations.
Under the hood, the context window expands to 128,000 tokens, four times the size of the previous 32K limit. This jump makes longer sessions and intricate agentic flows feasible without external state stitching. Reasoning effort is exposed as a selectable knob—minimal, low, medium, high, and xhigh—with low as the default to preserve latency. In OpenAI’s internal benchmarks, GPT‑Realtime‑2 at high effort outperformed its predecessor by 15.2% on the Big Bench Audio test and 13.8% on the Audio MultiChallenge instruction‑following benchmark.
Early customers already see measurable benefits. Zillow recorded a 26‑point lift in call‑success rate on its toughest adversarial benchmark, climbing from 69% to 95% after switching to GPT‑Realtime‑2. BolnaAI, which builds voice‑AI solutions for Indian languages, reported a 12.5% reduction in word‑error rates for Hindi, Tamil and Telugu when using the translation model.
Pricing signals OpenAI’s intent to disrupt the market. GPT‑Realtime‑2 costs $32 per million audio‑input tokens and $64 per million audio‑output tokens, with a $0.40 charge for cached input tokens. GPT‑Realtime‑Translate is priced at $0.034 per minute, roughly a third of a cent, undercutting most enterprise translation pipelines. GPT‑Realtime‑Whisper runs at $0.017 per minute, half the cost of comparable streaming transcription services.
The aggressive price cards put pressure on vendors like ElevenLabs, Deepgram and other voice‑infrastructure providers that traditionally bundle synthesis and inference on a per‑minute basis. While OpenAI’s models remove some integration work, developers still need to implement guardrails, compliance checks, brand‑voice tuning and analytics before deployment. OpenAI ships active classifiers and EU data residency options, but the onus of building a complete, production‑ready voice agent remains with the developer.
Industry observers will watch how quickly competing platforms can match OpenAI’s integrated stack. ElevenLabs recently closed a Series D round at an $11 billion valuation, betting on the “agent thesis,” while Deepgram continues to push its own streaming‑transcription offerings. The next quarter will likely be the first real‑world comparison of production workloads rather than demos.
For developers eager to experiment, OpenAI provides a Playground tab and an SDK call that let users test the new models immediately. The combination of higher‑quality reasoning, broader language coverage and aggressive pricing suggests OpenAI is not waiting for the market to catch up.
Dieser Artikel wurde mit Unterstützung von KI verfasst.
News Factory SEO hilft Ihnen, Nachrichteninhalte für Ihre Website zu automatisieren.