Suno AI, based in Cambridge, has created an audio AI named Bark that can generate natural-sounding speech from text prompts. In addition to vocals, Bark can interpret stage directions such as “[laughs]”, “[gasps]”, or “[sighs]”. Bark currently understands 13 languages that can be mixed in a prompt, and can even speak with an accent when switching to another language, known as code switching. However, the model’s primary reliability lies in English speech synthesis.
Bark makes no distinction between speech and music, sometimes recognizing text prompts as singing and performing them melodically. The audio AI uses Transformer models with over 100 million parameters, based on GPT-style models developed using nanoGPT and AudioLM. To generate speech in near-real-time, Bark requires GPU acceleration and an up-to-date PyTorch Nightly; without a GPU, processing times can increase by up to 100 times.
To prevent misuse, Bark has limitations on the number of synthetic voices for each language. Nevertheless, Bark can train its own voices using audio recordings, including intonation, pitch, emotion, and prosody. If you want to try the demo version of Bark, the project is available on Github under a Creative Commons Attribution-NonCommercial 4.0 International Public License. If you wish to test Suno AI’s language models in-depth, you can register for Suno Studio’s waiting list.