Kyutai, a French AI company, has unveiled a new AI-powered chatbot named “Moshi,” which boasts features rivaling ChatGPT’s delayed ‘Advanced Voice Mode’ GPT-4o. Moshi stands out with its ability to understand and interpret the tone of a user’s voice, respond faster, and operate offline.
Key Features of Moshi
- Advanced Voice Understanding: Moshi can comprehend various tones and nuances in human conversations, a feature supported by its ability to speak in different accents and 70 emotional and speaking styles.
- Simultaneous Audio Streams: The chatbot can handle two audio streams at once, allowing it to listen and talk simultaneously.
- Rapid Response Time: With a response time of just 200 milliseconds, Moshi is quicker than GPT-4o’s Advanced Voice Mode, which typically takes between 232 to 320 milliseconds.
Development and Technology
Moshi is built on a 7 billion parameter large language model (LLM) called Helium. Despite being smaller and developed in a mere six months by a team of eight researchers, Moshi was trained on 100,000 synthetic dialogues using Text-to-Speech technology. This focused approach allowed the team to infuse Moshi with the ability to replicate not just sentences but also tones and voices.
Collaborative Efforts
To enhance voice quality, Kyutai collaborated with a professional voice artist, ensuring that Moshi delivers high-quality audio responses. The chatbot, named after the Japanese way of answering a phone call, aims to bring a natural and engaging user experience.
Open Source and Privacy
Kyutai plans to make Moshi an open-source project, providing the model’s code and framework to the public. This move emphasizes the company’s commitment to privacy and user control, enabling users to safely use the chatbot without privacy concerns.
Future Developments
Kyutai is also developing an AI-powered audio identification, watermarking, and signature tracking system, which will eventually integrate with Moshi. Although Moshi is currently a research prototype, its speed and ability to replicate tones and voices mark a significant step forward in the development of offline-capable, open-source AI models.