Transformer-based speech and music generation workflow
This project compares three generative audio tasks in one workflow: standard text-to-speech, expressive speech synthesis, and text-conditioned music generation.
Challenge
- Generative audio models map text into different output modalities.
- Speech clarity, expressive delivery, and music generation need different prompts and model assumptions.
- Running multiple audio models requires careful setup and memory-aware execution.
System architecture
Data and inputs
The workflow uses controlled text prompts for spoken narration, expressive speech cues, and music descriptions.
Technical approach
- Use SpeechT5 with a vocoder for clear text-to-speech.
- Use Bark to test expressive speech controlled by prompt wording.
- Use MusicGen to generate short music from instrumentation and mood descriptions.
Evaluation and results
3 generative audio tasks
SpeechT5 / Bark / MusicGen
Speech waveform and spectrogram outputs
- SpeechT5 worked best for controlled narration.
- Bark responded to emotional and style cues in the prompt.
- MusicGen demonstrated text-to-audio generation beyond spoken language.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.
Future development
- Add listening notes and qualitative comparison tables.
- Track latency and memory use across models.
- Add longer prompt experiments for music structure.
Technical contribution
The project compares how prompt design controls speech, expression, and music across transformer-based audio models.