Realtime Speech-to-Text

OvenMediaEngine (OME) version 0.20.0 and later supports real-time automatic subtitles through integration with whisper.cpp. This feature converts live audio streams to text in real time and can optionally translate the recognized speech into English.

An NVIDIA GPU is required. CPU inference is not supported because it is too slow for real-time live transcription.

Prerequisites

NVIDIA GPU and Driver

Check your GPU and driver status using:

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.288.01             Driver Version: 535.288.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        On  | 00000000:3B:00.0  On |                  N/A |
| 20%   39C    P8              N/A /  75W |    204MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 4000 SFF Ada ...    On  | 00000000:AF:00.0 Off |                  Off |
| 30%   38C    P8               5W /  70W |    171MiB / 20475MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    802940      C   ...prise/src/bin/DEBUG/OvenMediaEngine       40MiB |
|    1   N/A  N/A    802940      C   ...prise/src/bin/DEBUG/OvenMediaEngine      158MiB |
+---------------------------------------------------------------------------------------+

If a driver is not installed, download it from the NVIDIA website or use the helper script provided in the OME repository.

Official driver: https://www.nvidia.com/en-us/drivers/

OME install script: https://github.com/OvenMediaLabs/OvenMediaEngine/blob/master/misc/install_nvidia_driver.sh

warning

This script installs the versions recommended by OME. If you want to install the latest version, change the parameters.

CUDA Toolkit

CUDA Toolkit is required to build whisper.cpp with GPU acceleration.

Download from: https://developer.nvidia.com/cuda-downloads
Use a version that matches your GPU generation.
- Recommended CUDA Toolkit : v12.0 ~ v12.8

Build and Install whisper.cpp

Install the prerequisites (including whisper.cpp built with GPU support) from the OME source root:

$ cmake -DOME_HWACCEL_NVIDIA=ON -P cmake/InstallPrerequisites.cmake

Configuration

STT configuration is split across two sections:

<Modules><Whisper> in Server.xml — preloads model files into GPU memory at startup.
<Application><Subtitles> — defines subtitle renditions (label, language, etc.) that STT output will be written to.
<Application><OutputProfiles><MediaOptions><STT> — connects an input audio track to a subtitle rendition via an STT engine.

warning

Breaking change: The <Transcription> element inside <Subtitles><Rendition> has been removed. If your existing configuration uses <Subtitles><Rendition><Transcription>, it will no longer work. Please migrate to <OutputProfiles><MediaOptions><STT><Rendition> as described below.

Step 1: Preload Models (Server.xml)

Declare the Whisper model files to load at server startup inside <Modules><Whisper>. Multiple <PreloadModel> entries are allowed. Models are loaded in descending file-size order to maximize GPU utilization.

Each <PreloadModel> entry has the following fields:

Key	Description
Path	Path to the model file. Can be absolute or relative to the config directory.
Devices	Comma-separated list of OME device indices to load the model onto (e.g. `0`, `0,1`, `2`), the same numbering as `<Video><Modules>nv:N</Modules>`. Set to `all` to load on every available GPU. If omitted, defaults to device 0.

<Server>
    <Modules>
        <Whisper>
            <!-- Load on GPU 0 (default when Devices is omitted) -->
            <PreloadModel>
                <Path>whisper_model/ggml-small.bin</Path>
            </PreloadModel>

            <!-- Load on all available GPUs -->
            <PreloadModel>
                <Path>whisper_model/ggml-medium.bin</Path>
                <Devices>all</Devices>
            </PreloadModel>

            <!-- Load on GPU 0 and GPU 1 -->
            <PreloadModel>
                <Path>whisper_model/ggml-large.bin</Path>
                <Devices>0,1</Devices>
            </PreloadModel>
        </Whisper>
    </Modules>
</Server>

info

<PreloadModel> is optional. If omitted, models are loaded on demand when the first stream that uses them is published. Preloading is recommended for production to avoid a delay on the first stream.

Step 2: Define Subtitle Renditions

Define the subtitle tracks that will receive STT output. For more details on <Subtitles>, refer to the Subtitles section.

<Application>
    <Subtitles>
        <Enable>true</Enable>
        <DefaultLabel>Korean</DefaultLabel>
        <Rendition>
            <Language>ko</Language>
            <Label>Korean</Label>
            <AutoSelect>true</AutoSelect>
            <Forced>false</Forced>
        </Rendition>
        <Rendition>
            <Language>en</Language>
            <Label>English</Label>
        </Rendition>
    </Subtitles>
</Application>

Step 3: Configure STT in OutputProfiles

Under <OutputProfiles><MediaOptions><STT>, add a <Rendition> for each audio-to-subtitle mapping. The <OutputSubtitleLabel> must match a <Label> defined in <Subtitles>.

<Application>
    <OutputProfiles>
        <MediaOptions>
            <STT>
                <!-- Korean STT on GPU 0 -->
                <Rendition>
                    <Engine>whisper</Engine>
                    <Model>whisper_model/ggml-small.bin</Model>
                    <Modules>nv:0</Modules>
                    <InputAudioIndex>0</InputAudioIndex>
                    <OutputSubtitleLabel>Korean</OutputSubtitleLabel>
                    <SourceLanguage>auto</SourceLanguage>
                    <Translation>false</Translation>
                    <!-- Optional: sliding-window tuning -->
                    <StepMs>2000</StepMs>
                    <LengthMs>10000</LengthMs>
                    <KeepMs>1500</KeepMs>
                </Rendition>
                <!-- English STT on GPU 1 -->
                <Rendition>
                    <Engine>whisper</Engine>
                    <Model>whisper_model/ggml-small.bin</Model>
                    <Modules>nv:1</Modules>
                    <InputAudioIndex>0</InputAudioIndex>
                    <OutputSubtitleLabel>English</OutputSubtitleLabel>
                    <SourceLanguage>auto</SourceLanguage>
                    <Translation>true</Translation>
                </Rendition>
            </STT>
        </MediaOptions>
    </OutputProfiles>
</Application>

The <STT><Rendition> configuration includes the following options:

Key	Description
Engine	The STT engine to use. Currently, only `whisper` is supported.
Model	Path to the whisper.cpp model file. Can be absolute or relative to the configuration directory (where Server.xml is located).
InputAudioIndex	Index of the audio track in the input stream to transcribe. Default is `0` (first audio track).
OutputSubtitleLabel	Label of the subtitle rendition (defined in `<Subtitles>`) to write the transcription output to.
SourceLanguage	Language code of the input audio (ISO 639-1, e.g., `ko`, `en`, `ja`). Set to `auto` to enable automatic detection.
Translation	When set to `true`, translates the recognized text into English. Whisper currently supports translation to English only.
StepMs	How many milliseconds of new audio to collect before running each inference call. Default is `2000`. Lower values reduce subtitle latency but increase GPU load.
LengthMs	Total size of the audio window (in milliseconds) passed to Whisper per inference call. Default is `10000`. Larger windows give the model more context and improve accuracy.
KeepMs	Amount of audio (in milliseconds) carried over from the previous window after a context reset. Default is `1500`. Helps avoid cut-off words at window boundaries.
Modules	Selects the GPU to run this STT rendition on, using the same format as video encoder modules (e.g. `nv:0`, `nv:1`). If omitted, GPU 0 is used. Use this to distribute multiple renditions across different GPUs.

Model

Model files can be downloaded from https://huggingface.co/ggerganov/whisper.cpp. For example:

$ wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
$ wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin
$ wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin
$ wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large.bin
$ wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin

Smaller models such as ggml-small.bin provide faster inference with lower accuracy. Larger models like ggml-medium.bin or ggml-large.bin offer higher accuracy at the cost of increased GPU memory and computation time.

Runtime Control via REST API

STT can be paused and resumed at runtime without restarting the server or recreating the stream. This is useful for temporarily disabling transcription for a specific stream (e.g., during ad breaks or when the stream is not speech-heavy) to save GPU resources.

For full API reference including request/response details and error codes, see STT Control.

Endpoint	Description
`POST :enableStt`	Resume STT inference for the stream
`POST :disableStt`	Pause STT inference, dropping audio frames without GPU processing
`POST :sttStatus`	Get current enabled state and per-rendition configuration

Disabling STT at Startup

STT can be started in the disabled (paused) state by setting <Enable>false</Enable> inside the <STT> block. In this case, no GPU inference runs until the stream receives an :enableStt call.

<STT>
    <Enable>false</Enable>
    <Rendition>
        ...
    </Rendition>
</STT>

Prerequisites​

NVIDIA GPU and Driver​

CUDA Toolkit​

Build and Install whisper.cpp​

Configuration​

Step 1: Preload Models (Server.xml)​

Step 2: Define Subtitle Renditions​

Step 3: Configure STT in OutputProfiles​

Model​

Runtime Control via REST API​

Disabling STT at Startup​