Speech Recognition

General Description

This component provides automatic speech recognition (ASR) capabilities using OpenAI's Whisper model and the Silero VAD. It processes audio streams received through ZMQ, detects voice activity using a Voice Activity Detection (VAD) system, and transcribes spoken content to text. The system includes a FastAPI server for controlling recording and retrieving transcription results. It supports different Whisper model sizes (base.en is recommended for optimal performance/accuracy balance), and includes features for voice activity detection with configurable parameters for adapting to different acoustic environments. The component is designed to work in real-time on embedded platforms like Raspberry Pi 5.

Resource Link
Source code https://gitc.piap.lukasiewicz.gov.pl/ai-prism/wp4/ai-based-perception-modules/speech-recognition
Demo Video

Contact

The following table includes contact information of the main developers in charge of the component:

Name Email Organisation
Dorin Clisu dorin.clisu@nttdata.com NTT Data Romania

License

Proprietary.

Technical Foundations

Integrated and Open Source Components

Overview

The Speech Recognition module integrates several open-source components to provide its functionality, including AnyIO for asynchronous IO operations, FastAPI for the REST interface, faster-whisper for speech recognition, SSE-Starlette for server-sent events, and Uvicorn as an ASGI server. These components work together to create a robust speech recognition system capable of handling real-time audio streams, detecting voice activity, and providing transcription results through a web API.

Pre-existing Components

FastAPI

Source

FastAPI is an open-source web framework for building APIs with Python. https://fastapi.tiangolo.com/

Description

FastAPI is a modern, high-performance web framework for building APIs with Python 3.7+ based on standard Python type hints. It provides automatic API documentation and validation based on OpenAPI standards.

Modifications

None.

Purpose in AI-PRISM

FastAPI provides the HTTP interface for controlling the speech recognition system, allowing clients to start recordings, receive voice activity status updates, and get transcription results.

License

FastAPI is licensed under the MIT License. https://github.com/tiangolo/fastapi/blob/master/LICENSE

Faster-Whisper

Source

Faster Whisper is an open-source speech recognition library. https://github.com/guillaumekln/faster-whisper

Description

Faster Whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which provides faster inference performance compared to the original implementation.

Modifications

None.

Purpose in AI-PRISM

Faster Whisper provides the core speech recognition functionality, converting audio signals to text transcripts.

License

Faster Whisper is licensed under the MIT License. https://github.com/guillaumekln/faster-whisper/blob/master/LICENSE

AnyIO

Source

AnyIO is an open-source asynchronous I/O library. https://github.com/agronholm/anyio

Description

AnyIO is a high-level asynchronous concurrency library that works on top of either asyncio or trio, providing a consistent API regardless of the backend.

Modifications

None.

Purpose in AI-PRISM

AnyIO provides the asynchronous capabilities needed for handling concurrent operations like receiving audio streams while processing existing data.

License

AnyIO is licensed under the MIT License. https://github.com/agronholm/anyio/blob/master/LICENSE

SSE-Starlette

Source

SSE-Starlette is an open-source library for Server-Sent Events. https://github.com/sysid/sse-starlette

Description

SSE-Starlette provides Server-Sent Events (SSE) support for Starlette and FastAPI applications, enabling real-time event streaming from server to client.

Modifications

None.

Purpose in AI-PRISM

SSE-Starlette is used to stream real-time VAD status updates to clients, allowing them to monitor voice activity detection in real-time.

License

SSE-Starlette is licensed under the MIT License. https://github.com/sysid/sse-starlette/blob/master/LICENSE

Uvicorn

Source

Uvicorn is an open-source ASGI server. https://www.uvicorn.org/

Description

Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools for optimal performance.

Modifications

None.

Purpose in AI-PRISM

Uvicorn serves as the ASGI server that hosts the FastAPI application, handling HTTP requests and responses efficiently.

License

Uvicorn is licensed under the BSD License. https://github.com/encode/uvicorn/blob/master/LICENSE.md

How to install

Every AI-PRISM component is installed using the Cluster management service. During the installation process, the user needs to configure a set of high-level parameters.

How to use

The Speech Recognition component can be used by connecting to its API endpoints. Access the API documentation at http://localhost/docs in the browser.

The main endpoints include: - /start-vad - Start voice activity detection and wait for a complete utterance. - /status-vad - Stream real-time VAD status updates.