Speech Recognition
General Description
This component provides automatic speech recognition (ASR) capabilities using OpenAI's Whisper model and the Silero VAD. It processes audio streams received through ZMQ, detects voice activity using a Voice Activity Detection (VAD) system, and transcribes spoken content to text. The system includes a FastAPI server for controlling recording and retrieving transcription results. It supports different Whisper model sizes (base.en is recommended for optimal performance/accuracy balance), and includes features for voice activity detection with configurable parameters for adapting to different acoustic environments. The component is designed to work in real-time on embedded platforms like Raspberry Pi 5.
| Resource | Link |
|---|---|
| Source code | https://gitc.piap.lukasiewicz.gov.pl/ai-prism/wp4/ai-based-perception-modules/speech-recognition |
| Demo Video |
Contact
The following table includes contact information of the main developers in charge of the component:
| Name | Organisation | |
|---|---|---|
| Dorin Clisu | dorin.clisu@nttdata.com |
License
Proprietary.
Technical Foundations
Integrated and Open Source Components
Overview
The Speech Recognition module integrates several open-source components to provide its functionality, including AnyIO for asynchronous IO operations, FastAPI for the REST interface, faster-whisper for speech recognition, SSE-Starlette for server-sent events, and Uvicorn as an ASGI server. These components work together to create a robust speech recognition system capable of handling real-time audio streams, detecting voice activity, and providing transcription results through a web API.
Pre-existing Components
FastAPI
Source
FastAPI is an open-source web framework for building APIs with Python. https://fastapi.tiangolo.com/
Description
FastAPI is a modern, high-performance web framework for building APIs with Python 3.7+ based on standard Python type hints. It provides automatic API documentation and validation based on OpenAPI standards.
Modifications
None.
Purpose in AI-PRISM
FastAPI provides the HTTP interface for controlling the speech recognition system, allowing clients to start recordings, receive voice activity status updates, and get transcription results.
License
FastAPI is licensed under the MIT License. https://github.com/tiangolo/fastapi/blob/master/LICENSE
Faster-Whisper
Source
Faster Whisper is an open-source speech recognition library. https://github.com/guillaumekln/faster-whisper
Description
Faster Whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which provides faster inference performance compared to the original implementation.
Modifications
None.
Purpose in AI-PRISM
Faster Whisper provides the core speech recognition functionality, converting audio signals to text transcripts.
License
Faster Whisper is licensed under the MIT License. https://github.com/guillaumekln/faster-whisper/blob/master/LICENSE
AnyIO
Source
AnyIO is an open-source asynchronous I/O library. https://github.com/agronholm/anyio
Description
AnyIO is a high-level asynchronous concurrency library that works on top of either asyncio or trio, providing a consistent API regardless of the backend.
Modifications
None.
Purpose in AI-PRISM
AnyIO provides the asynchronous capabilities needed for handling concurrent operations like receiving audio streams while processing existing data.
License
AnyIO is licensed under the MIT License. https://github.com/agronholm/anyio/blob/master/LICENSE
SSE-Starlette
Source
SSE-Starlette is an open-source library for Server-Sent Events. https://github.com/sysid/sse-starlette
Description
SSE-Starlette provides Server-Sent Events (SSE) support for Starlette and FastAPI applications, enabling real-time event streaming from server to client.
Modifications
None.
Purpose in AI-PRISM
SSE-Starlette is used to stream real-time VAD status updates to clients, allowing them to monitor voice activity detection in real-time.
License
SSE-Starlette is licensed under the MIT License. https://github.com/sysid/sse-starlette/blob/master/LICENSE
Uvicorn
Source
Uvicorn is an open-source ASGI server. https://www.uvicorn.org/
Description
Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools for optimal performance.
Modifications
None.
Purpose in AI-PRISM
Uvicorn serves as the ASGI server that hosts the FastAPI application, handling HTTP requests and responses efficiently.
License
Uvicorn is licensed under the BSD License. https://github.com/encode/uvicorn/blob/master/LICENSE.md
How to install
Every AI-PRISM component is installed using the Cluster management service. During the installation process, the user needs to configure a set of high-level parameters.
How to use
The Speech Recognition component can be used by connecting to its API endpoints. Access the API documentation at http://localhost/docs in the browser.
The main endpoints include:
- /start-vad - Start voice activity detection and wait for a complete utterance.
- /status-vad - Stream real-time VAD status updates.