Microsoft released three new internal AI models for text, speech, and graphics generation
Microsoft AI launches three new multimodal models
In an effort to strengthen its position in artificial intelligence (AI), Microsoft AI’s research division announced the release of three proprietary models capable of generating text, audio, and images. This move was a response to competition from leading AI labs.
| Model | Purpose | Key metrics |
|---|---|---|
| MAI‑Transcribe‑1 | Converts speech to text | 25 languages, 2.5× faster than Azure Fast |
| MAI‑Voice‑1 | Creates an audio track | One minute in one second, voice tuning |
| MAI‑Image‑2 | Generates images from text |
The project was developed by the MAI Superintelligence team—a division focused on fundamental research into advanced AI systems. In November 2025, executive director Mustafa Suleyman joined the team.
Cost efficiency Developers placed special emphasis on reducing compute costs compared to Google and OpenAI counterparts:
| Service | Price |
|---|---|
| Text transcription | $0.36/hour |
| Speech synthesis | $22 per 1 million characters |
| Image processing | $5 per 1 million input tokens; $33 for generating 1 million output tokens |
The models are already deployed on the Microsoft Foundry platform. Transcription and speech synthesis are available in MAI Playground.
Partnership with OpenAI Despite actively developing its own solutions, Mustafa Suleyman confirmed a commitment to collaborating with OpenAI: Microsoft has already invested over $13 billion. The company will continue using OpenAI models in its products under a long‑term contract, applying a diversification strategy similar to its work with microchips.
Thus, Microsoft AI is strengthening its market position by offering fast and cost‑effective multimodal solutions while maintaining close ties with key partners.
Comments (0)
Share your thoughts — please be polite and stay on topic.
Log in to comment