Compare commits

...

3 Commits

Author SHA1 Message Date
JoshADC
58e48f804f
Merge a2c43ad8bb into a0d6cb5c15 2026-02-26 13:48:31 -07:00
Josh Hawkins
a0d6cb5c15
Docs updates (#22131)
* fix config examples

* remove reference to trt model generation script

* tweak tmpfs comment

* update old version

* tweak tmpfs comment

* clean up and clarify tensorrt

* re-add size

* Update docs/docs/configuration/hardware_acceleration_enrichments.md

Co-authored-by: Nicolas Mowen <nickmowen213@gmail.com>

---------

Co-authored-by: Nicolas Mowen <nickmowen213@gmail.com>
2026-02-26 10:57:33 -07:00
Josh Casada
a2c43ad8bb feat: ZMQ embedding runner for offloading ONNX inference to native host
Extends the ZMQ split-detector pattern (apple-silicon-detector) to cover
ONNX embedding models — ArcFace face recognition and Jina semantic search.

On macOS, Docker has no access to CoreML or the Apple Neural Engine, so
embedding inference is forced to CPU (~200ms/face for ArcFace). This adds
a ZmqEmbeddingRunner that sends preprocessed tensors to a native host
process over ZMQ TCP and receives embeddings back, enabling CoreML/ANE
acceleration outside the container.

Files changed:
- frigate/detectors/detection_runners.py: add ZmqEmbeddingRunner class
  and hook into get_optimized_runner() via "zmq://" device prefix
- tools/zmq_embedding_server.py: new host-side server script

Tested on Mac Mini M4, 24h soak test, ~5000 object reindex.
2026-02-21 12:44:42 -05:00
9 changed files with 532 additions and 59 deletions

View File

@ -12,23 +12,20 @@ Some of Frigate's enrichments can use a discrete GPU or integrated GPU for accel
Object detection and enrichments (like Semantic Search, Face Recognition, and License Plate Recognition) are independent features. To use a GPU / NPU for object detection, see the [Object Detectors](/configuration/object_detectors.md) documentation. If you want to use your GPU for any supported enrichments, you must choose the appropriate Frigate Docker image for your GPU / NPU and configure the enrichment according to its specific documentation.
- **AMD**
- ROCm support in the `-rocm` Frigate image is automatically detected for enrichments, but only some enrichment models are available due to ROCm's focus on LLMs and limited stability with certain neural network models. Frigate disables models that perform poorly or are unstable to ensure reliable operation, so only compatible enrichments may be active.
- **Intel**
- OpenVINO will automatically be detected and used for enrichments in the default Frigate image.
- **Note:** Intel NPUs have limited model support for enrichments. GPU is recommended for enrichments when available.
- **Nvidia**
- Nvidia GPUs will automatically be detected and used for enrichments in the `-tensorrt` Frigate image.
- Jetson devices will automatically be detected and used for enrichments in the `-tensorrt-jp6` Frigate image.
- **RockChip**
- RockChip NPU will automatically be detected and used for semantic search v1 and face recognition in the `-rk` Frigate image.
Utilizing a GPU for enrichments does not require you to use the same GPU for object detection. For example, you can run the `tensorrt` Docker image for enrichments and still use other dedicated hardware like a Coral or Hailo for object detection. However, one combination that is not supported is TensorRT for object detection and OpenVINO for enrichments.
Utilizing a GPU for enrichments does not require you to use the same GPU for object detection. For example, you can run the `tensorrt` Docker image to run enrichments on an Nvidia GPU and still use other dedicated hardware like a Coral or Hailo for object detection. However, one combination that is not supported is the `tensorrt` image for object detection on an Nvidia GPU and Intel iGPU for enrichments.
:::note

View File

@ -29,12 +29,12 @@ cameras:
When running Frigate through the HA Add-on, the Frigate `/config` directory is mapped to `/addon_configs/<addon_directory>` in the host, where `<addon_directory>` is specific to the variant of the Frigate Add-on you are running.
| Add-on Variant | Configuration directory |
| -------------------------- | -------------------------------------------- |
| Frigate | `/addon_configs/ccab4aaf_frigate` |
| Frigate (Full Access) | `/addon_configs/ccab4aaf_frigate-fa` |
| Frigate Beta | `/addon_configs/ccab4aaf_frigate-beta` |
| Frigate Beta (Full Access) | `/addon_configs/ccab4aaf_frigate-fa-beta` |
| Add-on Variant | Configuration directory |
| -------------------------- | ----------------------------------------- |
| Frigate | `/addon_configs/ccab4aaf_frigate` |
| Frigate (Full Access) | `/addon_configs/ccab4aaf_frigate-fa` |
| Frigate Beta | `/addon_configs/ccab4aaf_frigate-beta` |
| Frigate Beta (Full Access) | `/addon_configs/ccab4aaf_frigate-fa-beta` |
**Whenever you see `/config` in the documentation, it refers to this directory.**
@ -109,15 +109,16 @@ detectors:
record:
enabled: True
retain:
motion:
days: 7
mode: motion
alerts:
retain:
days: 30
mode: motion
detections:
retain:
days: 30
mode: motion
snapshots:
enabled: True
@ -165,15 +166,16 @@ detectors:
record:
enabled: True
retain:
motion:
days: 7
mode: motion
alerts:
retain:
days: 30
mode: motion
detections:
retain:
days: 30
mode: motion
snapshots:
enabled: True
@ -231,15 +233,16 @@ model:
record:
enabled: True
retain:
motion:
days: 7
mode: motion
alerts:
retain:
days: 30
mode: motion
detections:
retain:
days: 30
mode: motion
snapshots:
enabled: True

View File

@ -34,7 +34,7 @@ Frigate supports multiple different detectors that work on different types of ha
**Nvidia GPU**
- [ONNX](#onnx): TensorRT will automatically be detected and used as a detector in the `-tensorrt` Frigate image when a supported ONNX model is configured.
- [ONNX](#onnx): Nvidia GPUs will automatically be detected and used as a detector in the `-tensorrt` Frigate image when a supported ONNX model is configured.
**Nvidia Jetson** <CommunityBadge />
@ -65,7 +65,7 @@ This does not affect using hardware for accelerating other tasks such as [semant
# Officially Supported Detectors
Frigate provides the following builtin detector types: `cpu`, `edgetpu`, `hailo8l`, `memryx`, `onnx`, `openvino`, `rknn`, and `tensorrt`. By default, Frigate will use a single CPU detector. Other detectors may require additional configuration as described below. When using multiple detectors they will run in dedicated processes, but pull from a common queue of detection requests from across all cameras.
Frigate provides a number of builtin detector types. By default, Frigate will use a single CPU detector. Other detectors may require additional configuration as described below. When using multiple detectors they will run in dedicated processes, but pull from a common queue of detection requests from across all cameras.
## Edge TPU Detector
@ -654,11 +654,9 @@ ONNX is an open format for building machine learning models, Frigate supports ru
If the correct build is used for your GPU then the GPU will be detected and used automatically.
- **AMD**
- ROCm will automatically be detected and used with the ONNX detector in the `-rocm` Frigate image.
- **Intel**
- OpenVINO will automatically be detected and used with the ONNX detector in the default Frigate image.
- **Nvidia**

View File

@ -41,8 +41,8 @@ If the EQ13 is out of stock, the link below may take you to a suggested alternat
| Name | Capabilities | Notes |
| ------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | --------------------------------------------------- |
| Beelink EQ13 (<a href="https://amzn.to/4jn2qVr" target="_blank" rel="nofollow noopener sponsored">Amazon</a>) | Can run object detection on several 1080p cameras with low-medium activity | Dual gigabit NICs for easy isolated camera network. |
| Intel 1120p ([Amazon](https://www.amazon.com/Beelink-i3-1220P-Computer-Display-Gigabit/dp/B0DDCKT9YP) | Can handle a large number of 1080p cameras with high activity | |
| Intel 125H ([Amazon](https://www.amazon.com/MINISFORUM-Pro-125H-Barebone-Computer-HDMI2-1/dp/B0FH21FSZM) | Can handle a significant number of 1080p cameras with high activity | Includes NPU for more efficient detection in 0.17+ |
| Intel 1120p ([Amazon](https://www.amazon.com/Beelink-i3-1220P-Computer-Display-Gigabit/dp/B0DDCKT9YP)) | Can handle a large number of 1080p cameras with high activity | |
| Intel 125H ([Amazon](https://www.amazon.com/MINISFORUM-Pro-125H-Barebone-Computer-HDMI2-1/dp/B0FH21FSZM)) | Can handle a significant number of 1080p cameras with high activity | Includes NPU for more efficient detection in 0.17+ |
## Detectors
@ -86,7 +86,7 @@ Frigate supports multiple different detectors that work on different types of ha
**Nvidia**
- [TensortRT](#tensorrt---nvidia-gpu): TensorRT can run on Nvidia GPUs to provide efficient object detection.
- [Nvidia GPU](#nvidia-gpus): Nvidia GPUs can provide efficient object detection.
- [Supports majority of model architectures via ONNX](../../configuration/object_detectors#onnx-supported-models)
- Runs well with any size models including large
@ -172,7 +172,7 @@ Inference speeds vary greatly depending on the CPU or GPU used, some known examp
| Intel Arc A380 | ~ 6 ms | | 320: ~ 10 ms 640: ~ 22 ms | 336: 20 ms 448: 27 ms | |
| Intel Arc A750 | ~ 4 ms | | 320: ~ 8 ms | | |
### TensorRT - Nvidia GPU
### Nvidia GPUs
Frigate is able to utilize an Nvidia GPU which supports the 12.x series of CUDA libraries.
@ -182,8 +182,6 @@ Frigate is able to utilize an Nvidia GPU which supports the 12.x series of CUDA
Make sure your host system has the [nvidia-container-runtime](https://docs.docker.com/config/containers/resource_constraints/#access-an-nvidia-gpu) installed to pass through the GPU to the container and the host system has a compatible driver installed for your GPU.
There are improved capabilities in newer GPU architectures that TensorRT can benefit from, such as INT8 operations and Tensor cores. The features compatible with your hardware will be optimized when the model is converted to a trt file. Currently the script provided for generating the model provides a switch to enable/disable FP16 operations. If you wish to use newer features such as INT8 optimization, more work is required.
#### Compatibility References:
[NVIDIA TensorRT Support Matrix](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html)
@ -192,7 +190,7 @@ There are improved capabilities in newer GPU architectures that TensorRT can ben
[NVIDIA GPU Compute Capability](https://developer.nvidia.com/cuda-gpus)
Inference speeds will vary greatly depending on the GPU and the model used.
Inference is done with the `onnx` detector type. Speeds will vary greatly depending on the GPU and the model used.
`tiny (t)` variants are faster than the equivalent non-tiny model, some known examples are below:
✅ - Accelerated with CUDA Graphs

View File

@ -56,7 +56,7 @@ services:
volumes:
- /path/to/your/config:/config
- /path/to/your/storage:/media/frigate
- type: tmpfs # Recommended: 1GB of memory
- type: tmpfs # 1GB In-memory filesystem for recording segment storage
target: /tmp/cache
tmpfs:
size: 1000000000
@ -123,7 +123,7 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
:::note
If you are **not** using a Raspberry Pi with **Bookworm OS**, skip this step and proceed directly to step 2.
If you are using Raspberry Pi with **Trixie OS**, also skip this step and proceed directly to step 2.
:::
@ -133,13 +133,13 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
```bash
lsmod | grep hailo
```
If it shows `hailo_pci`, unload it:
```bash
sudo modprobe -r hailo_pci
```
Then locate the built-in kernel driver and rename it so it cannot be loaded.
Renaming allows the original driver to be restored later if needed.
First, locate the currently installed kernel module:
@ -149,28 +149,29 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
```
Example output:
```
/lib/modules/6.6.31+rpt-rpi-2712/kernel/drivers/media/pci/hailo/hailo_pci.ko.xz
```
Save the module path to a variable:
```bash
BUILTIN=$(modinfo -n hailo_pci)
```
And rename the module by appending .bak:
```bash
sudo mv "$BUILTIN" "${BUILTIN}.bak"
```
Now refresh the kernel module map so the system recognizes the change:
```bash
sudo depmod -a
```
Reboot your Raspberry Pi:
```bash
@ -206,7 +207,6 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
```
The script will:
- Install necessary build dependencies
- Clone and build the Hailo driver from the official repository
- Install the driver
@ -236,18 +236,18 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
```
Verify the driver version:
```bash
cat /sys/module/hailo_pci/version
```
Verify that the firmware was installed correctly:
```bash
ls -l /lib/firmware/hailo/hailo8_fw.bin
```
**Optional: Fix PCIe descriptor page size error**
**Optional: Fix PCIe descriptor page size error**
If you encounter the following error:
@ -462,7 +462,7 @@ services:
- /etc/localtime:/etc/localtime:ro
- /path/to/your/config:/config
- /path/to/your/storage:/media/frigate
- type: tmpfs # Recommended: 1GB of memory
- type: tmpfs # 1GB In-memory filesystem for recording segment storage
target: /tmp/cache
tmpfs:
size: 1000000000
@ -502,12 +502,12 @@ The official docker image tags for the current stable version are:
- `stable` - Standard Frigate build for amd64 & RPi Optimized Frigate build for arm64. This build includes support for Hailo devices as well.
- `stable-standard-arm64` - Standard Frigate build for arm64
- `stable-tensorrt` - Frigate build specific for amd64 devices running an nvidia GPU
- `stable-tensorrt` - Frigate build specific for amd64 devices running an Nvidia GPU
- `stable-rocm` - Frigate build for [AMD GPUs](../configuration/object_detectors.md#amdrocm-gpu-detector)
The community supported docker image tags for the current stable version are:
- `stable-tensorrt-jp6` - Frigate build optimized for nvidia Jetson devices running Jetpack 6
- `stable-tensorrt-jp6` - Frigate build optimized for Nvidia Jetson devices running Jetpack 6
- `stable-rk` - Frigate build for SBCs with Rockchip SoC
## Home Assistant Add-on
@ -521,7 +521,7 @@ There are important limitations in HA OS to be aware of:
- Separate local storage for media is not yet supported by Home Assistant
- AMD GPUs are not supported because HA OS does not include the mesa driver.
- Intel NPUs are not supported because HA OS does not include the NPU firmware.
- Nvidia GPUs are not supported because addons do not support the nvidia runtime.
- Nvidia GPUs are not supported because addons do not support the Nvidia runtime.
:::
@ -694,17 +694,18 @@ Log into QNAP, open Container Station. Frigate docker container should be listed
:::warning
macOS uses port 5000 for its Airplay Receiver service. If you want to expose port 5000 in Frigate for local app and API access the port will need to be mapped to another port on the host e.g. 5001
macOS uses port 5000 for its Airplay Receiver service. If you want to expose port 5000 in Frigate for local app and API access the port will need to be mapped to another port on the host e.g. 5001
Failure to remap port 5000 on the host will result in the WebUI and all API endpoints on port 5000 being unreachable, even if port 5000 is exposed correctly in Docker.
:::
Docker containers on macOS can be orchestrated by either [Docker Desktop](https://docs.docker.com/desktop/setup/install/mac-install/) or [OrbStack](https://orbstack.dev) (native swift app). The difference in inference speeds is negligable, however CPU, power consumption and container start times will be lower on OrbStack because it is a native Swift application.
Docker containers on macOS can be orchestrated by either [Docker Desktop](https://docs.docker.com/desktop/setup/install/mac-install/) or [OrbStack](https://orbstack.dev) (native swift app). The difference in inference speeds is negligable, however CPU, power consumption and container start times will be lower on OrbStack because it is a native Swift application.
To allow Frigate to use the Apple Silicon Neural Engine / Processing Unit (NPU) the host must be running [Apple Silicon Detector](../configuration/object_detectors.md#apple-silicon-detector) on the host (outside Docker)
#### Docker Compose example
```yaml
services:
frigate:
@ -719,7 +720,7 @@ services:
ports:
- "8971:8971"
# If exposing on macOS map to a diffent host port like 5001 or any orher port with no conflicts
# - "5001:5000" # Internal unauthenticated access. Expose carefully.
# - "5001:5000" # Internal unauthenticated access. Expose carefully.
- "8554:8554" # RTSP feeds
extra_hosts:
# This is very important

View File

@ -20,7 +20,6 @@ Keeping Frigate up to date ensures you benefit from the latest features, perform
If youre running Frigate via Docker (recommended method), follow these steps:
1. **Stop the Container**:
- If using Docker Compose:
```bash
docker compose down frigate
@ -31,9 +30,8 @@ If youre running Frigate via Docker (recommended method), follow these steps:
```
2. **Update and Pull the Latest Image**:
- If using Docker Compose:
- Edit your `docker-compose.yml` file to specify the desired version tag (e.g., `0.17.0` instead of `0.16.3`). For example:
- Edit your `docker-compose.yml` file to specify the desired version tag (e.g., `0.17.0` instead of `0.16.4`). For example:
```yaml
services:
frigate:
@ -51,7 +49,6 @@ If youre running Frigate via Docker (recommended method), follow these steps:
```
3. **Start the Container**:
- If using Docker Compose:
```bash
docker compose up -d
@ -75,18 +72,15 @@ If youre running Frigate via Docker (recommended method), follow these steps:
For users running Frigate as a Home Assistant Addon:
1. **Check for Updates**:
- Navigate to **Settings > Add-ons** in Home Assistant.
- Find your installed Frigate addon (e.g., "Frigate NVR" or "Frigate NVR (Full Access)").
- If an update is available, youll see an "Update" button.
2. **Update the Addon**:
- Click the "Update" button next to the Frigate addon.
- Wait for the process to complete. Home Assistant will handle downloading and installing the new version.
3. **Restart the Addon**:
- After updating, go to the addons page and click "Restart" to apply the changes.
4. **Verify the Update**:
@ -105,8 +99,8 @@ If an update causes issues:
1. Stop Frigate.
2. Restore your backed-up config file and database.
3. Revert to the previous image version:
- For Docker: Specify an older tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.3`) in your `docker run` command.
- For Docker Compose: Edit your `docker-compose.yml`, specify the older version tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.3`), and re-run `docker compose up -d`.
- For Docker: Specify an older tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.4`) in your `docker run` command.
- For Docker Compose: Edit your `docker-compose.yml`, specify the older version tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.4`), and re-run `docker compose up -d`.
- For Home Assistant: Reinstall the previous addon version manually via the repository if needed and restart the addon.
4. Verify the old version is running again.

View File

@ -119,7 +119,7 @@ services:
volumes:
- ./config:/config
- ./storage:/media/frigate
- type: tmpfs # Optional: 1GB of memory, reduces SSD/SD Card wear
- type: tmpfs # 1GB In-memory filesystem for recording segment storage
target: /tmp/cache
tmpfs:
size: 1000000000

View File

@ -1,5 +1,6 @@
"""Base runner implementation for ONNX models."""
import json
import logging
import os
import platform
@ -10,6 +11,11 @@ from typing import Any
import numpy as np
import onnxruntime as ort
try:
import zmq as _zmq
except ImportError:
_zmq = None
from frigate.util.model import get_ort_providers
from frigate.util.rknn_converter import auto_convert_model, is_rknn_compatible
@ -548,12 +554,213 @@ class RKNNModelRunner(BaseModelRunner):
pass
class ZmqEmbeddingRunner(BaseModelRunner):
"""Send preprocessed embedding tensors over ZMQ to an external inference service.
This enables offloading ONNX embedding inference (e.g. ArcFace face recognition,
Jina semantic search) to a native host process that has access to hardware
acceleration unavailable inside Docker, such as CoreML/ANE on Apple Silicon.
Protocol:
- Request is a multipart message: [ header_json_bytes, tensor_bytes ]
where header is:
{
"shape": List[int], # e.g. [1, 3, 112, 112]
"dtype": str, # numpy dtype, e.g. "float32"
"model_type": str, # e.g. "arcface"
}
tensor_bytes are the raw C-order bytes of the input tensor.
- Response is either:
a) Multipart [ header_json_bytes, embedding_bytes ] with header specifying
shape and dtype of the returned embedding; or
b) Single frame of raw float32 bytes (embedding vector, batch-first).
On timeout or error, a zero embedding is returned so the caller can degrade
gracefully (the face will simply not be recognized for that frame).
Configuration example (face_recognition.device):
face_recognition:
enabled: true
model_size: large
device: "zmq://host.docker.internal:5556"
"""
# Model type → primary input name (used to answer get_input_names())
_INPUT_NAMES: dict[str, list[str]] = {}
# Model type → model input spatial width
_INPUT_WIDTHS: dict[str, int] = {}
# Model type → embedding output dimensionality (used for zero-fallback shape)
_OUTPUT_DIMS: dict[str, int] = {}
@classmethod
def _init_model_maps(cls) -> None:
"""Populate the model maps lazily to avoid circular imports at module load."""
if cls._INPUT_NAMES:
return
from frigate.embeddings.types import EnrichmentModelTypeEnum
cls._INPUT_NAMES = {
EnrichmentModelTypeEnum.arcface.value: ["data"],
EnrichmentModelTypeEnum.facenet.value: ["data"],
EnrichmentModelTypeEnum.jina_v1.value: ["pixel_values"],
EnrichmentModelTypeEnum.jina_v2.value: ["pixel_values"],
}
cls._INPUT_WIDTHS = {
EnrichmentModelTypeEnum.arcface.value: 112,
EnrichmentModelTypeEnum.facenet.value: 160,
EnrichmentModelTypeEnum.jina_v1.value: 224,
EnrichmentModelTypeEnum.jina_v2.value: 224,
}
cls._OUTPUT_DIMS = {
EnrichmentModelTypeEnum.arcface.value: 512,
EnrichmentModelTypeEnum.facenet.value: 128,
EnrichmentModelTypeEnum.jina_v1.value: 768,
EnrichmentModelTypeEnum.jina_v2.value: 768,
}
def __init__(
self,
endpoint: str,
model_type: str,
request_timeout_ms: int = 60000,
linger_ms: int = 0,
):
if _zmq is None:
raise ImportError(
"pyzmq is required for ZmqEmbeddingRunner. Install it with: pip install pyzmq"
)
self._init_model_maps()
# "zmq://host:port" is the Frigate config sentinel; ZMQ sockets need "tcp://host:port"
self._endpoint = endpoint.replace("zmq://", "tcp://", 1)
self._model_type = model_type
self._request_timeout_ms = request_timeout_ms
self._linger_ms = linger_ms
self._context = _zmq.Context()
self._socket = None
self._needs_reset = False
self._lock = threading.Lock()
self._create_socket()
logger.info(
f"ZmqEmbeddingRunner({model_type}): connected to {endpoint}"
)
def _create_socket(self) -> None:
if self._socket is not None:
try:
self._socket.close(linger=self._linger_ms)
except Exception:
pass
self._socket = self._context.socket(_zmq.REQ)
self._socket.setsockopt(_zmq.RCVTIMEO, self._request_timeout_ms)
self._socket.setsockopt(_zmq.SNDTIMEO, self._request_timeout_ms)
self._socket.setsockopt(_zmq.LINGER, self._linger_ms)
self._socket.connect(self._endpoint)
def get_input_names(self) -> list[str]:
return self._INPUT_NAMES.get(self._model_type, ["data"])
def get_input_width(self) -> int:
return self._INPUT_WIDTHS.get(self._model_type, -1)
def run(self, inputs: dict[str, Any]) -> list[np.ndarray]:
"""Send the primary input tensor over ZMQ and return the embedding.
For single-input models (ArcFace, FaceNet) the entire inputs dict maps to
one tensor. For multi-input models only the first tensor is sent; those
models are not yet supported for ZMQ offload.
"""
tensor_input = np.ascontiguousarray(next(iter(inputs.values())))
batch_size = tensor_input.shape[0]
with self._lock:
# Lazy reset: if a previous call errored, reset the socket now — before any
# ZMQ operations — so we don't manipulate sockets inside an error handler where
# Frigate's own ZMQ threads may be polling and could hit a libzmq assertion.
# The lock ensures only one thread touches the socket at a time (ZMQ REQ
# sockets are not thread-safe; concurrent calls from the reindex thread and
# the normal embedding maintainer thread would corrupt the socket state).
if self._needs_reset:
self._reset_socket()
self._needs_reset = False
try:
header = {
"shape": list(tensor_input.shape),
"dtype": str(tensor_input.dtype.name),
"model_type": self._model_type,
}
header_bytes = json.dumps(header).encode("utf-8")
payload_bytes = memoryview(tensor_input.tobytes(order="C"))
self._socket.send_multipart([header_bytes, payload_bytes])
reply_frames = self._socket.recv_multipart()
return self._decode_response(reply_frames)
except _zmq.Again:
logger.warning(
f"ZmqEmbeddingRunner({self._model_type}): request timed out, will reset socket before next call"
)
self._needs_reset = True
return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
except _zmq.ZMQError as exc:
logger.error(f"ZmqEmbeddingRunner({self._model_type}) ZMQError: {exc}, will reset socket before next call")
self._needs_reset = True
return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
except Exception as exc:
logger.error(f"ZmqEmbeddingRunner({self._model_type}) unexpected error: {exc}")
return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
def _reset_socket(self) -> None:
try:
self._create_socket()
except Exception:
pass
def _decode_response(self, frames: list[bytes]) -> list[np.ndarray]:
try:
if len(frames) >= 2:
header = json.loads(frames[0].decode("utf-8"))
shape = tuple(header.get("shape", []))
dtype = np.dtype(header.get("dtype", "float32"))
return [np.frombuffer(frames[1], dtype=dtype).reshape(shape)]
elif len(frames) == 1:
# Raw float32 bytes — reshape to (1, embedding_dim)
arr = np.frombuffer(frames[0], dtype=np.float32)
return [arr.reshape((1, -1))]
else:
logger.warning(f"ZmqEmbeddingRunner({self._model_type}): empty reply")
return [np.zeros((1, self._get_output_dim()), dtype=np.float32)]
except Exception as exc:
logger.error(
f"ZmqEmbeddingRunner({self._model_type}): failed to decode response: {exc}"
)
return [np.zeros((1, self._get_output_dim()), dtype=np.float32)]
def _get_output_dim(self) -> int:
return self._OUTPUT_DIMS.get(self._model_type, 512)
def __del__(self) -> None:
try:
if self._socket is not None:
self._socket.close(linger=self._linger_ms)
except Exception:
pass
def get_optimized_runner(
model_path: str, device: str | None, model_type: str, **kwargs
) -> BaseModelRunner:
"""Get an optimized runner for the hardware."""
device = device or "AUTO"
# ZMQ embedding runner — offloads ONNX inference to a native host process.
# Triggered when device is a ZMQ endpoint, e.g. "zmq://host.docker.internal:5556".
if device.startswith("zmq://"):
return ZmqEmbeddingRunner(endpoint=device, model_type=model_type)
if device != "CPU" and is_rknn_compatible(model_path):
rknn_path = auto_convert_model(model_path)

View File

@ -0,0 +1,275 @@
"""ZMQ Embedding Server — native Mac (Apple Silicon) inference service.
Runs ONNX models using hardware acceleration unavailable inside Docker on macOS,
specifically CoreML and the Apple Neural Engine. Frigate's Docker container
connects to this server over ZMQ TCP, sends preprocessed tensors, and receives
embedding vectors back.
Supported models:
- ArcFace (face recognition, 512-dim output)
- FaceNet (face recognition, 128-dim output)
- Jina V1/V2 vision (semantic search, 768-dim output)
Requirements (install outside Docker, on the Mac host):
pip install onnxruntime pyzmq numpy
Usage:
# ArcFace face recognition (port 5556):
python tools/zmq_embedding_server.py \\
--model /config/model_cache/facedet/arcface.onnx \\
--model-type arcface \\
--port 5556
# Jina V1 vision semantic search (port 5557):
python tools/zmq_embedding_server.py \\
--model /config/model_cache/jinaai/jina-clip-v1/vision_model_quantized.onnx \\
--model-type jina_v1 \\
--port 5557
Frigate config (docker-compose / config.yaml):
face_recognition:
enabled: true
model_size: large
device: "zmq://host.docker.internal:5556"
semantic_search:
enabled: true
model_size: small
device: "zmq://host.docker.internal:5557"
Protocol (REQ/REP):
Request: multipart [ header_json_bytes, tensor_bytes ]
header = {
"shape": [batch, channels, height, width], # e.g. [1, 3, 112, 112]
"dtype": "float32",
"model_type": "arcface",
}
tensor_bytes = raw C-order numpy bytes
Response: multipart [ header_json_bytes, embedding_bytes ]
header = {
"shape": [batch, embedding_dim], # e.g. [1, 512]
"dtype": "float32",
}
embedding_bytes = raw C-order numpy bytes
"""
import argparse
import json
import logging
import os
import signal
import sys
import time
import numpy as np
import zmq
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
logger = logging.getLogger("zmq_embedding_server")
# Models that require ORT_ENABLE_BASIC optimization to avoid graph fusion issues
# (e.g. SimplifiedLayerNormFusion creates nodes that some providers can't handle).
_COMPLEX_MODELS = {"jina_v1", "jina_v2"}
# ---------------------------------------------------------------------------
# ONNX Runtime session (CoreML preferred on Apple Silicon)
# ---------------------------------------------------------------------------
def build_ort_session(model_path: str, model_type: str = ""):
"""Create an ONNX Runtime InferenceSession, preferring CoreML on macOS.
Jina V1/V2 models use ORT_ENABLE_BASIC graph optimization to avoid
fusion passes (e.g. SimplifiedLayerNormFusion) that produce unsupported
nodes. All other models use the default ORT_ENABLE_ALL.
"""
import onnxruntime as ort
available = ort.get_available_providers()
logger.info(f"Available ORT providers: {available}")
# Prefer CoreMLExecutionProvider on Apple Silicon for ANE/GPU acceleration.
# Falls back automatically to CPUExecutionProvider if CoreML is unavailable.
preferred = []
if "CoreMLExecutionProvider" in available:
preferred.append("CoreMLExecutionProvider")
logger.info("Using CoreMLExecutionProvider (Apple Neural Engine / GPU)")
else:
logger.warning(
"CoreMLExecutionProvider not available — falling back to CPU. "
"Install onnxruntime-silicon or a CoreML-enabled onnxruntime build."
)
preferred.append("CPUExecutionProvider")
sess_options = None
if model_type in _COMPLEX_MODELS:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
)
logger.info(f"Using ORT_ENABLE_BASIC optimization for {model_type}")
session = ort.InferenceSession(model_path, sess_options=sess_options, providers=preferred)
input_names = [inp.name for inp in session.get_inputs()]
output_names = [out.name for out in session.get_outputs()]
logger.info(f"Model loaded: inputs={input_names}, outputs={output_names}")
return session
# ---------------------------------------------------------------------------
# Inference helpers
# ---------------------------------------------------------------------------
def run_arcface(session, tensor: np.ndarray) -> np.ndarray:
"""Run ArcFace — input (1, 3, 112, 112) float32, output (1, 512) float32."""
outputs = session.run(None, {"data": tensor})
return outputs[0] # shape (1, 512)
def run_generic(session, tensor: np.ndarray) -> np.ndarray:
"""Generic single-input ONNX model runner."""
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: tensor})
return outputs[0]
_RUNNERS = {
"arcface": run_arcface,
"facenet": run_generic,
"jina_v1": run_generic,
"jina_v2": run_generic,
}
# Model type → input shape for warmup inference (triggers CoreML JIT compilation
# before the first real request arrives, avoiding a ZMQ timeout on cold start).
_WARMUP_SHAPES = {
"arcface": (1, 3, 112, 112),
"facenet": (1, 3, 160, 160),
"jina_v1": (1, 3, 224, 224),
"jina_v2": (1, 3, 224, 224),
}
def warmup(session, model_type: str) -> None:
"""Run a dummy inference to trigger CoreML JIT compilation."""
shape = _WARMUP_SHAPES.get(model_type)
if shape is None:
return
logger.info(f"Warming up CoreML model ({model_type})…")
dummy = np.zeros(shape, dtype=np.float32)
try:
runner = _RUNNERS.get(model_type, run_generic)
runner(session, dummy)
logger.info("Warmup complete")
except Exception as exc:
logger.warning(f"Warmup failed (non-fatal): {exc}")
# ---------------------------------------------------------------------------
# ZMQ server loop
# ---------------------------------------------------------------------------
def serve(session, port: int, model_type: str) -> None:
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind(f"tcp://0.0.0.0:{port}")
logger.info(f"Listening on tcp://0.0.0.0:{port} (model_type={model_type})")
runner = _RUNNERS.get(model_type, run_generic)
def _shutdown(sig, frame):
logger.info("Shutting down…")
socket.close(linger=0)
context.term()
sys.exit(0)
signal.signal(signal.SIGINT, _shutdown)
signal.signal(signal.SIGTERM, _shutdown)
while True:
try:
frames = socket.recv_multipart()
except zmq.ZMQError as exc:
logger.error(f"recv error: {exc}")
continue
if len(frames) < 2:
logger.warning(f"Received unexpected frame count: {len(frames)}, ignoring")
socket.send_multipart([b"{}"])
continue
try:
header = json.loads(frames[0].decode("utf-8"))
shape = tuple(header["shape"])
dtype = np.dtype(header.get("dtype", "float32"))
tensor = np.frombuffer(frames[1], dtype=dtype).reshape(shape)
except Exception as exc:
logger.error(f"Failed to decode request: {exc}")
socket.send_multipart([b"{}"])
continue
try:
t0 = time.monotonic()
embedding = runner(session, tensor)
elapsed_ms = (time.monotonic() - t0) * 1000
if elapsed_ms > 2000:
logger.warning(f"slow inference {elapsed_ms:.1f}ms shape={shape}")
resp_header = json.dumps(
{"shape": list(embedding.shape), "dtype": str(embedding.dtype.name)}
).encode("utf-8")
resp_payload = memoryview(np.ascontiguousarray(embedding).tobytes())
socket.send_multipart([resp_header, resp_payload])
except Exception as exc:
logger.error(f"Inference error: {exc}")
# Return a zero embedding so the client can degrade gracefully
zero = np.zeros((1, 512), dtype=np.float32)
resp_header = json.dumps(
{"shape": list(zero.shape), "dtype": "float32"}
).encode("utf-8")
socket.send_multipart([resp_header, memoryview(zero.tobytes())])
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="ZMQ Embedding Server for Frigate")
parser.add_argument(
"--model",
required=True,
help="Path to the ONNX model file (e.g. /config/model_cache/facedet/arcface.onnx)",
)
parser.add_argument(
"--model-type",
default="arcface",
choices=list(_RUNNERS.keys()),
help="Model type key (default: arcface)",
)
parser.add_argument(
"--port",
type=int,
default=5556,
help="TCP port to listen on (default: 5556)",
)
args = parser.parse_args()
if not os.path.exists(args.model):
logger.error(f"Model file not found: {args.model}")
sys.exit(1)
logger.info(f"Loading model: {args.model}")
session = build_ort_session(args.model, model_type=args.model_type)
warmup(session, model_type=args.model_type)
serve(session, port=args.port, model_type=args.model_type)
if __name__ == "__main__":
main()