Speech Synthesis
General Description
The Speech Synthesis module provides text-to-speech capabilities for the AI-PRISM platform. It offers a REST API to convert text into natural-sounding speech using multiple TTS engines, including KokoroTTS and MeloTTS. The service supports multiple voices, languages, and speed adjustments with results cached for improved performance. It runs as a containerized microservice with configurable parameters and can be deployed using Docker or accessed directly via its FastAPI interface. The component downloads required models automatically and provides audio output as WAV files.
| Resource | Link |
|---|---|
| Source code | https://gitc.piap.lukasiewicz.gov.pl/ai-prism/wp4/human-robot-interaction/speech-synthesis |
| Demo Video |
Contact
The following table includes contact information of the main developers in charge of the component:
| Name | Organisation | |
|---|---|---|
| Dorin Clisu | dorin.clisu@nttdata.com | |
| Iulia Farcas | iulia.farcas@nttdata.com |
License
Proprietary.
Technical Foundations
Integrated and Open Source Components
Overview
The Speech Synthesis module integrates several open-source components to provide text-to-speech functionality. It uses FastAPI as the web framework for the API endpoints, Gradio Client for remote TTS model interactions, Kokoro-ONNX for efficient inference, Pydantic Settings for configuration management, SoundFile for audio file handling, and Uvicorn as the ASGI server.
Pre-existing Components
FastAPI
Source
FastAPI is a modern, fast web framework for building APIs with Python. https://fastapi.tiangolo.com/
Description
FastAPI is a high-performance web framework for building APIs based on Python type hints. It provides automatic API documentation, request validation, and serialization.
Modifications
None.
Purpose in AI-PRISM
FastAPI is used to create the web API that exposes the text-to-speech functionality, allowing other AI-PRISM components to request speech synthesis through HTTP requests.
License
MIT License - https://github.com/tiangolo/fastapi/blob/master/LICENSE
Gradio Client
Source
Gradio Client is a Python library for interacting with hosted Gradio applications. https://github.com/gradio-app/gradio
Description
Gradio Client allows Python applications to interact with Gradio interfaces programmatically, making API calls to Gradio-hosted models.
Modifications
None.
Purpose in AI-PRISM
In this project, Gradio Client is used to interact with the MeloTTS model service, allowing the speech synthesis component to utilize external TTS capabilities.
License
Apache License 2.0 - https://github.com/gradio-app/gradio/blob/main/LICENSE
Kokoro-ONNX
Source
Kokoro-ONNX is a library for text-to-speech using ONNX models. https://github.com/thewh1teagle/kokoro-onnx
Description
Kokoro-ONNX provides optimized inference for TTS models using the ONNX runtime, offering efficient voice synthesis from text.
Modifications
None.
Purpose in AI-PRISM
Kokoro-ONNX is the primary TTS engine in the speech synthesis component, providing efficient text-to-speech conversion with multiple voices and adjustable speed.
License
MIT License - https://github.com/thewh1teagle/kokoro-onnx/blob/main/LICENSE
Pydantic Settings
Source
Pydantic Settings is a library for settings management using Pydantic models. https://docs.pydantic.dev/latest/usage/pydantic_settings/
Description
Pydantic Settings extends Pydantic to provide settings management with support for environment variables, configuration files, and secrets.
Modifications
None.
Purpose in AI-PRISM
Pydantic Settings is used to manage the configuration of the speech synthesis component, handling environment variables and defaults for server settings and model paths.
License
MIT License - https://github.com/pydantic/pydantic/blob/main/LICENSE
SoundFile
Source
SoundFile is a library for reading and writing sound files. https://github.com/bastibe/python-soundfile
Description
SoundFile provides simple audio file I/O with support for various formats, built on the libsndfile C library.
Modifications
None.
Purpose in AI-PRISM
SoundFile is used to write generated audio data to WAV files that are served to clients requesting speech synthesis.
License
BSD 3-Clause License - https://github.com/bastibe/python-soundfile/blob/master/LICENSE
Uvicorn
Source
Uvicorn is an ASGI web server implementation. https://www.uvicorn.org/
Description
Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools for optimal performance.
Modifications
None.
Purpose in AI-PRISM
Uvicorn serves as the web server for the FastAPI application, handling HTTP requests to the speech synthesis API endpoints.
License
BSD 3-Clause License - https://github.com/encode/uvicorn/blob/master/LICENSE.md
How to install
Every AI-PRISM component is installed using the Cluster management service. During the installation process, the user needs to configure a set of high-level parameters.
How to use
To use the Speech Synthesis service, you can make HTTP requests to the /synthesize endpoint with the following parameters:
- text: The text to convert to speech (required)
- model: TTS model to use (default: 'kokorotts', options: 'kokorotts', 'melotts')
- speed: Speech rate multiplier (default: 1.0)
- voice: Voice ID for KokoroTTS (default: 'af_heart')
- language: Language code for MeloTTS (default: 'EN')
- speaker: Speaker ID for MeloTTS (default: 'EN-Default')
- cache: Whether to cache results (default: true)
The response will be a WAV audio file containing the synthesized speech.