Speech & Music Generation with Transformers

Generative Media May 2, 2026 Published project

Transformer-based speech and music generation workflow

This project compares three generative audio tasks in one workflow: standard text-to-speech, expressive speech synthesis, and text-conditioned music generation.

View source code Back to Projects

PythonTransformersSpeechT5BarkMusicGenHiFi-GAN

Share project

Challenge

Generative audio models map text into different output modalities.
Speech clarity, expressive delivery, and music generation need different prompts and model assumptions.
Running multiple audio models requires careful setup and memory-aware execution.

System architecture

Text prompt

SpeechT5

Bark

MusicGen

Data and inputs

The workflow uses controlled text prompts for spoken narration, expressive speech cues, and music descriptions.

Technical approach

Use SpeechT5 with a vocoder for clear text-to-speech.
Use Bark to test expressive speech controlled by prompt wording.
Use MusicGen to generate short music from instrumentation and mood descriptions.

Evaluation and results

Key indicators

3 generative audio tasks

Key indicators

SpeechT5 / Bark / MusicGen

Key indicators

Speech waveform and spectrogram outputs

SpeechT5 worked best for controlled narration.
Bark responded to emotional and style cues in the prompt.
MusicGen demonstrated text-to-audio generation beyond spoken language.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.

Future development

Add listening notes and qualitative comparison tables.
Track latency and memory use across models.
Add longer prompt experiments for music structure.

Technical contribution

The project compares how prompt design controls speech, expression, and music across transformer-based audio models.