Speech Synthesis

General Description

The Speech Synthesis module provides text-to-speech capabilities for the AI-PRISM platform. It offers a REST API to convert text into natural-sounding speech using multiple TTS engines, including KokoroTTS and MeloTTS. The service supports multiple voices, languages, and speed adjustments with results cached for improved performance. It runs as a containerized microservice with configurable parameters and can be deployed using Docker or accessed directly via its FastAPI interface. The component downloads required models automatically and provides audio output as WAV files.

Resource Link
Source code https://gitc.piap.lukasiewicz.gov.pl/ai-prism/wp4/human-robot-interaction/speech-synthesis
Demo Video

Contact

The following table includes contact information of the main developers in charge of the component:

Name Email Organisation
Dorin Clisu dorin.clisu@nttdata.com NTT Data Romania
Iulia Farcas iulia.farcas@nttdata.com NTT Data Romania

License

Proprietary.

Technical Foundations

Integrated and Open Source Components

Overview

The Speech Synthesis module integrates several open-source components to provide text-to-speech functionality. It uses FastAPI as the web framework for the API endpoints, Gradio Client for remote TTS model interactions, Kokoro-ONNX for efficient inference, Pydantic Settings for configuration management, SoundFile for audio file handling, and Uvicorn as the ASGI server.

Pre-existing Components

FastAPI

Source

FastAPI is a modern, fast web framework for building APIs with Python. https://fastapi.tiangolo.com/

Description

FastAPI is a high-performance web framework for building APIs based on Python type hints. It provides automatic API documentation, request validation, and serialization.

Modifications

None.

Purpose in AI-PRISM

FastAPI is used to create the web API that exposes the text-to-speech functionality, allowing other AI-PRISM components to request speech synthesis through HTTP requests.

License

MIT License - https://github.com/tiangolo/fastapi/blob/master/LICENSE

Gradio Client

Source

Gradio Client is a Python library for interacting with hosted Gradio applications. https://github.com/gradio-app/gradio

Description

Gradio Client allows Python applications to interact with Gradio interfaces programmatically, making API calls to Gradio-hosted models.

Modifications

None.

Purpose in AI-PRISM

In this project, Gradio Client is used to interact with the MeloTTS model service, allowing the speech synthesis component to utilize external TTS capabilities.

License

Apache License 2.0 - https://github.com/gradio-app/gradio/blob/main/LICENSE

Kokoro-ONNX

Source

Kokoro-ONNX is a library for text-to-speech using ONNX models. https://github.com/thewh1teagle/kokoro-onnx

Description

Kokoro-ONNX provides optimized inference for TTS models using the ONNX runtime, offering efficient voice synthesis from text.

Modifications

None.

Purpose in AI-PRISM

Kokoro-ONNX is the primary TTS engine in the speech synthesis component, providing efficient text-to-speech conversion with multiple voices and adjustable speed.

License

MIT License - https://github.com/thewh1teagle/kokoro-onnx/blob/main/LICENSE

Pydantic Settings

Source

Pydantic Settings is a library for settings management using Pydantic models. https://docs.pydantic.dev/latest/usage/pydantic_settings/

Description

Pydantic Settings extends Pydantic to provide settings management with support for environment variables, configuration files, and secrets.

Modifications

None.

Purpose in AI-PRISM

Pydantic Settings is used to manage the configuration of the speech synthesis component, handling environment variables and defaults for server settings and model paths.

License

MIT License - https://github.com/pydantic/pydantic/blob/main/LICENSE

SoundFile

Source

SoundFile is a library for reading and writing sound files. https://github.com/bastibe/python-soundfile

Description

SoundFile provides simple audio file I/O with support for various formats, built on the libsndfile C library.

Modifications

None.

Purpose in AI-PRISM

SoundFile is used to write generated audio data to WAV files that are served to clients requesting speech synthesis.

License

BSD 3-Clause License - https://github.com/bastibe/python-soundfile/blob/master/LICENSE

Uvicorn

Source

Uvicorn is an ASGI web server implementation. https://www.uvicorn.org/

Description

Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools for optimal performance.

Modifications

None.

Purpose in AI-PRISM

Uvicorn serves as the web server for the FastAPI application, handling HTTP requests to the speech synthesis API endpoints.

License

BSD 3-Clause License - https://github.com/encode/uvicorn/blob/master/LICENSE.md

How to install

Every AI-PRISM component is installed using the Cluster management service. During the installation process, the user needs to configure a set of high-level parameters.

How to use

To use the Speech Synthesis service, you can make HTTP requests to the /synthesize endpoint with the following parameters: - text: The text to convert to speech (required) - model: TTS model to use (default: 'kokorotts', options: 'kokorotts', 'melotts') - speed: Speech rate multiplier (default: 1.0) - voice: Voice ID for KokoroTTS (default: 'af_heart') - language: Language code for MeloTTS (default: 'EN') - speaker: Speaker ID for MeloTTS (default: 'EN-Default') - cache: Whether to cache results (default: true)

The response will be a WAV audio file containing the synthesized speech.