Recommended uploaded audio duration is at least 5 seconds and no more than 2 minutes.
Use clear, noise-free human speech recordings for both reference audio and style audio.
If style audio and its corresponding transcript are provided during cloning, the system can extract style and emotion features from the style audio and synthesize them with timbre information from the reference audio. The style audio and reference audio can come from different speakers.
Speech synthesis can use the u2-tts-clone model, and you can call the cloned voice with the corresponding voice_id.