frigate/audio_detectors.md at 3246440711a104b37c20f0cda5a74ecd59da9ae6

mirror of https://github.com/blakeblackshear/frigate.git synced 2026-05-01 19:17:41 +03:00

Josh Hawkins 3246440711 docs

2025-05-26 07:25:23 -05:00

6.4 KiB

Raw Blame History

id	title
audio_detectors	Audio Detectors

Frigate provides a builtin audio detector which runs on the CPU. Compared to object detection in images, audio detection is a relatively lightweight operation so the only option is to run the detection on a CPU.

Configuration

Audio events work by detecting a type of audio and creating an event, the event will end once the type of audio has not been heard for the configured amount of time. Audio events save a snapshot at the beginning of the event as well as recordings throughout the event. The recordings are retained using the configured recording retention.

Enabling Audio Events

Audio events can be enabled for all cameras or only for specific cameras.


audio: # <- enable audio events for all camera
  enabled: True

cameras:
  front_camera:
    ffmpeg:
    ...
    audio:
      enabled: True # <- enable audio events for the front_camera

If you are using multiple streams then you must set the audio role on the stream that is going to be used for audio detection, this can be any stream but the stream must have audio included.

:::note

The ffmpeg process for capturing audio will be a separate connection to the camera along with the other roles assigned to the camera, for this reason it is recommended that the go2rtc restream is used for this purpose. See the restream docs for more information.

:::

cameras:
  front_camera:
    ffmpeg:
      inputs:
        - path: rtsp://.../main_stream
          roles:
            - record
        - path: rtsp://.../sub_stream # <- this stream must have audio enabled
          roles:
            - audio
            - detect

Configuring Minimum Volume

The audio detector uses volume levels in the same way that motion in a camera feed is used for object detection. This means that frigate will not run audio detection unless the audio volume is above the configured level in order to reduce resource usage. Audio levels can vary widely between camera models so it is important to run tests to see what volume levels are. MQTT explorer can be used on the audio topic to see what volume level is being detected.

:::tip

Volume is considered motion for recordings, this means when the record -> retain -> mode is set to motion any time audio volume is > min_volume that recording segment for that camera will be kept.

:::

Configuring Audio Events

The included audio model has over 500 different types of audio that can be detected, many of which are not practical. By default bark, fire_alarm, scream, speech, and yell are enabled but these can be customized.

audio:
  enabled: True
  listen:
    - bark
    - fire_alarm
    - scream
    - speech
    - yell

Audio Transcription

Frigate supports fully local text transcription using sherpa-onnx and OpenAI's fully local, open source Whisper models (using faster-whisper). Enable audio transcription features at the global level in your config:

audio_transcription:
  enabled: True

Audio transcription can also be enabled for select cameras only at the camera level:

cameras:
  back_yard:
    ...
    audio_transcription:
      enabled: True

:::note

Audio detection must be enabled and configured as described above in order to use audio transcription features.

:::

Optional config parameters that can be set at the global level include:

device: Device to use to run transcription and translation models.
- Default: CPU
- This can be CPU or GPU. The sherpa-onnx models are lightweight and run on the CPU only. The whisper models can run on GPU but are only supported on CUDA hardware.
model_size: The size of the model used for live transcription.
- Default: small
- This can be small or large. The small setting uses sherpa-onnx models that are fast, lightweight, and always run on the CPU but are not as accurate as the whisper model.
- The
- This config option applies to live transcription only. Recorded speech events will always use a different whisper model (and can be accelerated for CUDA hardware if available with device: GPU).
language: Defines the language used by whisper to translate speech audio events (and live audio only if using the large model).
- Default: en
- You must use a valid language code.
- Transcriptions for speech events are translated.
- Live audio is translated only if you are using the large model. The small sherpa-onnx model is English-only.

Live transcription

The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the audio role.

Results can be error-prone due to a number of factors, including:

Poor quality camera microphone
Distance of the audio source to the camera microphone
Low audio bitrate setting in the camera
Background noise
Using the small model - it's fast, but not accurate for poor quality audio

For speech sources close to the camera with minimal background noise, use the small model.

If you have CUDA hardware, you can experiment with the large whisper model on GPU. Performance is not quite as fast as the sherpa-onnx small model, but live transcription is far more accurate.

Transcription and translation of `speech` audio events

Any speech events in Explore can be transcribed and/or translated through the Transcribe button in the Tracked Object Details pane.

In order to use transcription and translation for past events, you must enable audio detection and define speech as an audio type to listen for in your config. To have speech events translated into the language of your choice, set the language config parameter with the correct language code.

The transcribed/translated speech will appear in the description box in the Tracked Object Details pane. If Semantic Search is enabled, embeddings are generated for the transcription text and are fully searchable using the description search type.

Recorded speech events will always use a whisper model, regardless of the model_size config setting. Without a GPU, generating transcriptions for longer speech events may take a fair amount of time, so be patient.

6.4 KiB Raw Blame History