Merge a2c43ad8bb into a0d6cb5c15

Docs updates (#22131 )
* fix config examples * remove reference to trt model generation script * tweak tmpfs comment * update old version * tweak tmpfs comment * clean up and clarify tensorrt * re-add size * Update docs/docs/configuration/hardware_acceleration_enrichments.md Co-authored-by: Nicolas Mowen <nickmowen213@gmail.com> --------- Co-authored-by: Nicolas Mowen <nickmowen213@gmail.com>
2026-03-18 14:18:21 +03:00 · 2026-02-26 13:48:31 -07:00 · 2026-02-26 10:57:33 -07:00 · 2026-02-21 12:44:42 -05:00
9 changed files with 532 additions and 59 deletions
--- a/docs/docs/configuration/hardware_acceleration_enrichments.md
+++ b/docs/docs/configuration/hardware_acceleration_enrichments.md
@ -12,23 +12,20 @@ Some of Frigate's enrichments can use a discrete GPU or integrated GPU for accel
 Object detection and enrichments (like Semantic Search, Face Recognition, and License Plate Recognition) are independent features. To use a GPU / NPU for object detection, see the [Object Detectors](/configuration/object_detectors.md) documentation. If you want to use your GPU for any supported enrichments, you must choose the appropriate Frigate Docker image for your GPU / NPU and configure the enrichment according to its specific documentation.

 - **AMD**
-
  - ROCm support in the `-rocm` Frigate image is automatically detected for enrichments, but only some enrichment models are available due to ROCm's focus on LLMs and limited stability with certain neural network models. Frigate disables models that perform poorly or are unstable to ensure reliable operation, so only compatible enrichments may be active.

 - **Intel**
-
  - OpenVINO will automatically be detected and used for enrichments in the default Frigate image.
  - **Note:** Intel NPUs have limited model support for enrichments. GPU is recommended for enrichments when available.

 - **Nvidia**
-
  - Nvidia GPUs will automatically be detected and used for enrichments in the `-tensorrt` Frigate image.
  - Jetson devices will automatically be detected and used for enrichments in the `-tensorrt-jp6` Frigate image.

 - **RockChip**
  - RockChip NPU will automatically be detected and used for semantic search v1 and face recognition in the `-rk` Frigate image.

-Utilizing a GPU for enrichments does not require you to use the same GPU for object detection. For example, you can run the `tensorrt` Docker image for enrichments and still use other dedicated hardware like a Coral or Hailo for object detection. However, one combination that is not supported is TensorRT for object detection and OpenVINO for enrichments.
+Utilizing a GPU for enrichments does not require you to use the same GPU for object detection. For example, you can run the `tensorrt` Docker image to run enrichments on an Nvidia GPU and still use other dedicated hardware like a Coral or Hailo for object detection. However, one combination that is not supported is the `tensorrt` image for object detection on an Nvidia GPU and Intel iGPU for enrichments.

 :::note

--- a/docs/docs/configuration/index.md
+++ b/docs/docs/configuration/index.md
@ -29,12 +29,12 @@ cameras:

 When running Frigate through the HA Add-on, the Frigate `/config` directory is mapped to `/addon_configs/<addon_directory>` in the host, where `<addon_directory>` is specific to the variant of the Frigate Add-on you are running.

-| Add-on Variant             | Configuration directory                      |
-| -------------------------- | -------------------------------------------- |
-| Frigate                    | `/addon_configs/ccab4aaf_frigate`            |
-| Frigate (Full Access)      | `/addon_configs/ccab4aaf_frigate-fa`         |
-| Frigate Beta               | `/addon_configs/ccab4aaf_frigate-beta`       |
-| Frigate Beta (Full Access) | `/addon_configs/ccab4aaf_frigate-fa-beta`    |
+| Add-on Variant             | Configuration directory                   |
+| -------------------------- | ----------------------------------------- |
+| Frigate                    | `/addon_configs/ccab4aaf_frigate`         |
+| Frigate (Full Access)      | `/addon_configs/ccab4aaf_frigate-fa`      |
+| Frigate Beta               | `/addon_configs/ccab4aaf_frigate-beta`    |
+| Frigate Beta (Full Access) | `/addon_configs/ccab4aaf_frigate-fa-beta` |

 **Whenever you see `/config` in the documentation, it refers to this directory.**

@ -109,15 +109,16 @@ detectors:

 record:
  enabled: True
-  retain:
+  motion:
    days: 7
-    mode: motion
  alerts:
    retain:
      days: 30
+      mode: motion
  detections:
    retain:
      days: 30
+      mode: motion

 snapshots:
  enabled: True
@ -165,15 +166,16 @@ detectors:

 record:
  enabled: True
-  retain:
+  motion:
    days: 7
-    mode: motion
  alerts:
    retain:
      days: 30
+      mode: motion
  detections:
    retain:
      days: 30
+      mode: motion

 snapshots:
  enabled: True
@ -231,15 +233,16 @@ model:

 record:
  enabled: True
-  retain:
+  motion:
    days: 7
-    mode: motion
  alerts:
    retain:
      days: 30
+      mode: motion
  detections:
    retain:
      days: 30
+      mode: motion

 snapshots:
  enabled: True
--- a/docs/docs/configuration/object_detectors.md
+++ b/docs/docs/configuration/object_detectors.md
@ -34,7 +34,7 @@ Frigate supports multiple different detectors that work on different types of ha

 **Nvidia GPU**

- [ONNX](#onnx): TensorRT will automatically be detected and used as a detector in the `-tensorrt` Frigate image when a supported ONNX model is configured.
+- [ONNX](#onnx): Nvidia GPUs will automatically be detected and used as a detector in the `-tensorrt` Frigate image when a supported ONNX model is configured.

 **Nvidia Jetson** <CommunityBadge />

@ -65,7 +65,7 @@ This does not affect using hardware for accelerating other tasks such as [semant

 # Officially Supported Detectors

-Frigate provides the following builtin detector types: `cpu`, `edgetpu`, `hailo8l`, `memryx`, `onnx`, `openvino`, `rknn`, and `tensorrt`. By default, Frigate will use a single CPU detector. Other detectors may require additional configuration as described below. When using multiple detectors they will run in dedicated processes, but pull from a common queue of detection requests from across all cameras.
+Frigate provides a number of builtin detector types. By default, Frigate will use a single CPU detector. Other detectors may require additional configuration as described below. When using multiple detectors they will run in dedicated processes, but pull from a common queue of detection requests from across all cameras.

 ## Edge TPU Detector

@ -654,11 +654,9 @@ ONNX is an open format for building machine learning models, Frigate supports ru
 If the correct build is used for your GPU then the GPU will be detected and used automatically.

 - **AMD**
-
  - ROCm will automatically be detected and used with the ONNX detector in the `-rocm` Frigate image.

 - **Intel**
-
  - OpenVINO will automatically be detected and used with the ONNX detector in the default Frigate image.

 - **Nvidia**
--- a/docs/docs/frigate/hardware.md
+++ b/docs/docs/frigate/hardware.md
@ -41,8 +41,8 @@ If the EQ13 is out of stock, the link below may take you to a suggested alternat
 | Name                                                                                                          | Capabilities                                                               | Notes                                               |
 | ------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | --------------------------------------------------- |
 | Beelink EQ13 (<a href="https://amzn.to/4jn2qVr" target="_blank" rel="nofollow noopener sponsored">Amazon</a>) | Can run object detection on several 1080p cameras with low-medium activity | Dual gigabit NICs for easy isolated camera network. |
-| Intel 1120p ([Amazon](https://www.amazon.com/Beelink-i3-1220P-Computer-Display-Gigabit/dp/B0DDCKT9YP)         | Can handle a large number of 1080p cameras with high activity              |                                                     |
-| Intel 125H ([Amazon](https://www.amazon.com/MINISFORUM-Pro-125H-Barebone-Computer-HDMI2-1/dp/B0FH21FSZM)      | Can handle a significant number of 1080p cameras with high activity        | Includes NPU for more efficient detection in 0.17+  |
+| Intel 1120p ([Amazon](https://www.amazon.com/Beelink-i3-1220P-Computer-Display-Gigabit/dp/B0DDCKT9YP))        | Can handle a large number of 1080p cameras with high activity              |                                                     |
+| Intel 125H ([Amazon](https://www.amazon.com/MINISFORUM-Pro-125H-Barebone-Computer-HDMI2-1/dp/B0FH21FSZM))     | Can handle a significant number of 1080p cameras with high activity        | Includes NPU for more efficient detection in 0.17+  |

 ## Detectors

@ -86,7 +86,7 @@ Frigate supports multiple different detectors that work on different types of ha

 **Nvidia**

- [TensortRT](#tensorrt---nvidia-gpu): TensorRT can run on Nvidia GPUs to provide efficient object detection.
+- [Nvidia GPU](#nvidia-gpus): Nvidia GPUs can provide efficient object detection.
  - [Supports majority of model architectures via ONNX](../../configuration/object_detectors#onnx-supported-models)
  - Runs well with any size models including large

@ -172,7 +172,7 @@ Inference speeds vary greatly depending on the CPU or GPU used, some known examp
 | Intel Arc A380 | ~ 6 ms                     |                                                   | 320: ~ 10 ms 640: ~ 22 ms | 336: 20 ms 448: 27 ms  |                                    |
 | Intel Arc A750 | ~ 4 ms                     |                                                   | 320: ~ 8 ms               |                        |                                    |

-### TensorRT - Nvidia GPU
+### Nvidia GPUs

 Frigate is able to utilize an Nvidia GPU which supports the 12.x series of CUDA libraries.

@ -182,8 +182,6 @@ Frigate is able to utilize an Nvidia GPU which supports the 12.x series of CUDA

 Make sure your host system has the [nvidia-container-runtime](https://docs.docker.com/config/containers/resource_constraints/#access-an-nvidia-gpu) installed to pass through the GPU to the container and the host system has a compatible driver installed for your GPU.

-There are improved capabilities in newer GPU architectures that TensorRT can benefit from, such as INT8 operations and Tensor cores. The features compatible with your hardware will be optimized when the model is converted to a trt file. Currently the script provided for generating the model provides a switch to enable/disable FP16 operations. If you wish to use newer features such as INT8 optimization, more work is required.
-
 #### Compatibility References:

 [NVIDIA TensorRT Support Matrix](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html)
@ -192,7 +190,7 @@ There are improved capabilities in newer GPU architectures that TensorRT can ben

 [NVIDIA GPU Compute Capability](https://developer.nvidia.com/cuda-gpus)

-Inference speeds will vary greatly depending on the GPU and the model used.
+Inference is done with the `onnx` detector type. Speeds will vary greatly depending on the GPU and the model used.
 `tiny (t)` variants are faster than the equivalent non-tiny model, some known examples are below:

 ✅ - Accelerated with CUDA Graphs
--- a/docs/docs/frigate/installation.md
+++ b/docs/docs/frigate/installation.md
@ -56,7 +56,7 @@ services:
    volumes:
      - /path/to/your/config:/config
      - /path/to/your/storage:/media/frigate
-      - type: tmpfs # Recommended: 1GB of memory
+      - type: tmpfs # 1GB In-memory filesystem for recording segment storage
        target: /tmp/cache
        tmpfs:
          size: 1000000000
@ -123,7 +123,7 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
   :::note

   If you are **not** using a Raspberry Pi with **Bookworm OS**, skip this step and proceed directly to step 2.
-   
+
   If you are using Raspberry Pi with **Trixie OS**, also skip this step and proceed directly to step 2.

   :::
@ -133,13 +133,13 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
   ```bash
   lsmod | grep hailo
   ```
-   
+
   If it shows `hailo_pci`, unload it:

   ```bash
   sudo modprobe -r hailo_pci
   ```
-   
+
   Then locate the built-in kernel driver and rename it so it cannot be loaded.
   Renaming allows the original driver to be restored later if needed.
   First, locate the currently installed kernel module:
@ -149,28 +149,29 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
   ```

   Example output:
-   
+
   ```
   /lib/modules/6.6.31+rpt-rpi-2712/kernel/drivers/media/pci/hailo/hailo_pci.ko.xz
   ```
+
   Save the module path to a variable:
-   
+
   ```bash
   BUILTIN=$(modinfo -n hailo_pci)
   ```

   And rename the module by appending .bak:
-    
+
   ```bash
   sudo mv "$BUILTIN" "${BUILTIN}.bak"
   ```
-   
+
   Now refresh the kernel module map so the system recognizes the change:
-   
+
   ```bash
   sudo depmod -a
   ```
-   
+
   Reboot your Raspberry Pi:

   ```bash
@ -206,7 +207,6 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
   ```

   The script will:
-
   - Install necessary build dependencies
   - Clone and build the Hailo driver from the official repository
   - Install the driver
@ -236,18 +236,18 @@ On Raspberry Pi OS **Trixie**, the Hailo driver is no longer shipped with the ke
   ```

   Verify the driver version:
-   
+
   ```bash
   cat /sys/module/hailo_pci/version
   ```
-   
+
   Verify that the firmware was installed correctly:
-   
+
   ```bash
   ls -l /lib/firmware/hailo/hailo8_fw.bin
   ```

-  **Optional: Fix PCIe descriptor page size error**
+   **Optional: Fix PCIe descriptor page size error**

   If you encounter the following error:

@ -462,7 +462,7 @@ services:
      - /etc/localtime:/etc/localtime:ro
      - /path/to/your/config:/config
      - /path/to/your/storage:/media/frigate
-      - type: tmpfs # Recommended: 1GB of memory
+      - type: tmpfs # 1GB In-memory filesystem for recording segment storage
        target: /tmp/cache
        tmpfs:
          size: 1000000000
@ -502,12 +502,12 @@ The official docker image tags for the current stable version are:

 - `stable` - Standard Frigate build for amd64 & RPi Optimized Frigate build for arm64. This build includes support for Hailo devices as well.
 - `stable-standard-arm64` - Standard Frigate build for arm64
- `stable-tensorrt` - Frigate build specific for amd64 devices running an nvidia GPU
+- `stable-tensorrt` - Frigate build specific for amd64 devices running an Nvidia GPU
 - `stable-rocm` - Frigate build for [AMD GPUs](../configuration/object_detectors.md#amdrocm-gpu-detector)

 The community supported docker image tags for the current stable version are:

- `stable-tensorrt-jp6` - Frigate build optimized for nvidia Jetson devices running Jetpack 6
+- `stable-tensorrt-jp6` - Frigate build optimized for Nvidia Jetson devices running Jetpack 6
 - `stable-rk` - Frigate build for SBCs with Rockchip SoC

 ## Home Assistant Add-on
@ -521,7 +521,7 @@ There are important limitations in HA OS to be aware of:
 - Separate local storage for media is not yet supported by Home Assistant
 - AMD GPUs are not supported because HA OS does not include the mesa driver.
 - Intel NPUs are not supported because HA OS does not include the NPU firmware.
- Nvidia GPUs are not supported because addons do not support the nvidia runtime.
+- Nvidia GPUs are not supported because addons do not support the Nvidia runtime.

 :::

@ -694,17 +694,18 @@ Log into QNAP, open Container Station. Frigate docker container should be listed

 :::warning

-macOS uses port 5000 for its Airplay Receiver service.  If you want to expose port 5000 in Frigate for local app and API access the port will need to be mapped to another port on the host e.g. 5001
+macOS uses port 5000 for its Airplay Receiver service. If you want to expose port 5000 in Frigate for local app and API access the port will need to be mapped to another port on the host e.g. 5001

 Failure to remap port 5000 on the host will result in the WebUI and all API endpoints on port 5000 being unreachable, even if port 5000 is exposed correctly in Docker.

 :::

-Docker containers on macOS can be orchestrated by either [Docker Desktop](https://docs.docker.com/desktop/setup/install/mac-install/) or [OrbStack](https://orbstack.dev) (native swift app). The difference in inference speeds is negligable, however CPU, power consumption and container start times will be lower on OrbStack because it is a native Swift application. 
+Docker containers on macOS can be orchestrated by either [Docker Desktop](https://docs.docker.com/desktop/setup/install/mac-install/) or [OrbStack](https://orbstack.dev) (native swift app). The difference in inference speeds is negligable, however CPU, power consumption and container start times will be lower on OrbStack because it is a native Swift application.

 To allow Frigate to use the Apple Silicon Neural Engine / Processing Unit (NPU) the host must be running [Apple Silicon Detector](../configuration/object_detectors.md#apple-silicon-detector) on the host (outside Docker)

 #### Docker Compose example
+
 ```yaml
 services:
  frigate:
@ -719,7 +720,7 @@ services:
    ports:
      - "8971:8971"
      # If exposing on macOS map to a diffent host port like 5001 or any orher port with no conflicts
-      # - "5001:5000" # Internal unauthenticated access. Expose carefully. 
+      # - "5001:5000" # Internal unauthenticated access. Expose carefully.
      - "8554:8554" # RTSP feeds
    extra_hosts:
      # This is very important
--- a/docs/docs/frigate/updating.md
+++ b/docs/docs/frigate/updating.md
@ -20,7 +20,6 @@ Keeping Frigate up to date ensures you benefit from the latest features, perform
 If you’re running Frigate via Docker (recommended method), follow these steps:

 1. **Stop the Container**:
-
   - If using Docker Compose:
     ```bash
     docker compose down frigate
@ -31,9 +30,8 @@ If you’re running Frigate via Docker (recommended method), follow these steps:
     ```

 2. **Update and Pull the Latest Image**:
-
   - If using Docker Compose:
-     - Edit your `docker-compose.yml` file to specify the desired version tag (e.g., `0.17.0` instead of `0.16.3`). For example:
+     - Edit your `docker-compose.yml` file to specify the desired version tag (e.g., `0.17.0` instead of `0.16.4`). For example:
       ```yaml
       services:
         frigate:
@ -51,7 +49,6 @@ If you’re running Frigate via Docker (recommended method), follow these steps:
       ```

 3. **Start the Container**:
-
   - If using Docker Compose:
     ```bash
     docker compose up -d
@ -75,18 +72,15 @@ If you’re running Frigate via Docker (recommended method), follow these steps:
 For users running Frigate as a Home Assistant Addon:

 1. **Check for Updates**:
-
   - Navigate to **Settings > Add-ons** in Home Assistant.
   - Find your installed Frigate addon (e.g., "Frigate NVR" or "Frigate NVR (Full Access)").
   - If an update is available, you’ll see an "Update" button.

 2. **Update the Addon**:
-
   - Click the "Update" button next to the Frigate addon.
   - Wait for the process to complete. Home Assistant will handle downloading and installing the new version.

 3. **Restart the Addon**:
-
   - After updating, go to the addon’s page and click "Restart" to apply the changes.

 4. **Verify the Update**:
@ -105,8 +99,8 @@ If an update causes issues:
 1. Stop Frigate.
 2. Restore your backed-up config file and database.
 3. Revert to the previous image version:
-   - For Docker: Specify an older tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.3`) in your `docker run` command.
-   - For Docker Compose: Edit your `docker-compose.yml`, specify the older version tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.3`), and re-run `docker compose up -d`.
+   - For Docker: Specify an older tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.4`) in your `docker run` command.
+   - For Docker Compose: Edit your `docker-compose.yml`, specify the older version tag (e.g., `ghcr.io/blakeblackshear/frigate:0.16.4`), and re-run `docker compose up -d`.
   - For Home Assistant: Reinstall the previous addon version manually via the repository if needed and restart the addon.
 4. Verify the old version is running again.

--- a/docs/docs/guides/getting_started.md
+++ b/docs/docs/guides/getting_started.md
@ -119,7 +119,7 @@ services:
    volumes:
      - ./config:/config
      - ./storage:/media/frigate
-      - type: tmpfs # Optional: 1GB of memory, reduces SSD/SD Card wear
+      - type: tmpfs # 1GB In-memory filesystem for recording segment storage
        target: /tmp/cache
        tmpfs:
          size: 1000000000
--- a/frigate/detectors/detection_runners.py
+++ b/frigate/detectors/detection_runners.py
@ -1,5 +1,6 @@
 """Base runner implementation for ONNX models."""

+import json
 import logging
 import os
 import platform
@ -10,6 +11,11 @@ from typing import Any
 import numpy as np
 import onnxruntime as ort

+try:
+    import zmq as _zmq
+except ImportError:
+    _zmq = None
+
 from frigate.util.model import get_ort_providers
 from frigate.util.rknn_converter import auto_convert_model, is_rknn_compatible

@ -548,12 +554,213 @@ class RKNNModelRunner(BaseModelRunner):
                pass


+class ZmqEmbeddingRunner(BaseModelRunner):
+    """Send preprocessed embedding tensors over ZMQ to an external inference service.
+
+    This enables offloading ONNX embedding inference (e.g. ArcFace face recognition,
+    Jina semantic search) to a native host process that has access to hardware
+    acceleration unavailable inside Docker, such as CoreML/ANE on Apple Silicon.
+
+    Protocol:
+    - Request is a multipart message: [ header_json_bytes, tensor_bytes ]
+      where header is:
+        {
+          "shape": List[int],       # e.g. [1, 3, 112, 112]
+          "dtype": str,             # numpy dtype, e.g. "float32"
+          "model_type": str,        # e.g. "arcface"
+        }
+      tensor_bytes are the raw C-order bytes of the input tensor.
+
+    - Response is either:
+        a) Multipart [ header_json_bytes, embedding_bytes ] with header specifying
+           shape and dtype of the returned embedding; or
+        b) Single frame of raw float32 bytes (embedding vector, batch-first).
+
+    On timeout or error, a zero embedding is returned so the caller can degrade
+    gracefully (the face will simply not be recognized for that frame).
+
+    Configuration example (face_recognition.device):
+        face_recognition:
+          enabled: true
+          model_size: large
+          device: "zmq://host.docker.internal:5556"
+    """
+
+    # Model type → primary input name (used to answer get_input_names())
+    _INPUT_NAMES: dict[str, list[str]] = {}
+
+    # Model type → model input spatial width
+    _INPUT_WIDTHS: dict[str, int] = {}
+
+    # Model type → embedding output dimensionality (used for zero-fallback shape)
+    _OUTPUT_DIMS: dict[str, int] = {}
+
+    @classmethod
+    def _init_model_maps(cls) -> None:
+        """Populate the model maps lazily to avoid circular imports at module load."""
+        if cls._INPUT_NAMES:
+            return
+        from frigate.embeddings.types import EnrichmentModelTypeEnum
+
+        cls._INPUT_NAMES = {
+            EnrichmentModelTypeEnum.arcface.value: ["data"],
+            EnrichmentModelTypeEnum.facenet.value: ["data"],
+            EnrichmentModelTypeEnum.jina_v1.value: ["pixel_values"],
+            EnrichmentModelTypeEnum.jina_v2.value: ["pixel_values"],
+        }
+        cls._INPUT_WIDTHS = {
+            EnrichmentModelTypeEnum.arcface.value: 112,
+            EnrichmentModelTypeEnum.facenet.value: 160,
+            EnrichmentModelTypeEnum.jina_v1.value: 224,
+            EnrichmentModelTypeEnum.jina_v2.value: 224,
+        }
+        cls._OUTPUT_DIMS = {
+            EnrichmentModelTypeEnum.arcface.value: 512,
+            EnrichmentModelTypeEnum.facenet.value: 128,
+            EnrichmentModelTypeEnum.jina_v1.value: 768,
+            EnrichmentModelTypeEnum.jina_v2.value: 768,
+        }
+
+    def __init__(
+        self,
+        endpoint: str,
+        model_type: str,
+        request_timeout_ms: int = 60000,
+        linger_ms: int = 0,
+    ):
+        if _zmq is None:
+            raise ImportError(
+                "pyzmq is required for ZmqEmbeddingRunner. Install it with: pip install pyzmq"
+            )
+        self._init_model_maps()
+        # "zmq://host:port" is the Frigate config sentinel; ZMQ sockets need "tcp://host:port"
+        self._endpoint = endpoint.replace("zmq://", "tcp://", 1)
+        self._model_type = model_type
+        self._request_timeout_ms = request_timeout_ms
+        self._linger_ms = linger_ms
+        self._context = _zmq.Context()
+        self._socket = None
+        self._needs_reset = False
+        self._lock = threading.Lock()
+        self._create_socket()
+        logger.info(
+            f"ZmqEmbeddingRunner({model_type}): connected to {endpoint}"
+        )
+
+    def _create_socket(self) -> None:
+        if self._socket is not None:
+            try:
+                self._socket.close(linger=self._linger_ms)
+            except Exception:
+                pass
+        self._socket = self._context.socket(_zmq.REQ)
+        self._socket.setsockopt(_zmq.RCVTIMEO, self._request_timeout_ms)
+        self._socket.setsockopt(_zmq.SNDTIMEO, self._request_timeout_ms)
+        self._socket.setsockopt(_zmq.LINGER, self._linger_ms)
+        self._socket.connect(self._endpoint)
+
+    def get_input_names(self) -> list[str]:
+        return self._INPUT_NAMES.get(self._model_type, ["data"])
+
+    def get_input_width(self) -> int:
+        return self._INPUT_WIDTHS.get(self._model_type, -1)
+
+    def run(self, inputs: dict[str, Any]) -> list[np.ndarray]:
+        """Send the primary input tensor over ZMQ and return the embedding.
+
+        For single-input models (ArcFace, FaceNet) the entire inputs dict maps to
+        one tensor.  For multi-input models only the first tensor is sent; those
+        models are not yet supported for ZMQ offload.
+        """
+        tensor_input = np.ascontiguousarray(next(iter(inputs.values())))
+        batch_size = tensor_input.shape[0]
+
+        with self._lock:
+            # Lazy reset: if a previous call errored, reset the socket now — before any
+            # ZMQ operations — so we don't manipulate sockets inside an error handler where
+            # Frigate's own ZMQ threads may be polling and could hit a libzmq assertion.
+            # The lock ensures only one thread touches the socket at a time (ZMQ REQ
+            # sockets are not thread-safe; concurrent calls from the reindex thread and
+            # the normal embedding maintainer thread would corrupt the socket state).
+            if self._needs_reset:
+                self._reset_socket()
+                self._needs_reset = False
+
+            try:
+                header = {
+                    "shape": list(tensor_input.shape),
+                    "dtype": str(tensor_input.dtype.name),
+                    "model_type": self._model_type,
+                }
+                header_bytes = json.dumps(header).encode("utf-8")
+                payload_bytes = memoryview(tensor_input.tobytes(order="C"))
+
+                self._socket.send_multipart([header_bytes, payload_bytes])
+                reply_frames = self._socket.recv_multipart()
+                return self._decode_response(reply_frames)
+
+            except _zmq.Again:
+                logger.warning(
+                    f"ZmqEmbeddingRunner({self._model_type}): request timed out, will reset socket before next call"
+                )
+                self._needs_reset = True
+                return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
+            except _zmq.ZMQError as exc:
+                logger.error(f"ZmqEmbeddingRunner({self._model_type}) ZMQError: {exc}, will reset socket before next call")
+                self._needs_reset = True
+                return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
+            except Exception as exc:
+                logger.error(f"ZmqEmbeddingRunner({self._model_type}) unexpected error: {exc}")
+                return [np.zeros((batch_size, self._get_output_dim()), dtype=np.float32)]
+
+    def _reset_socket(self) -> None:
+        try:
+            self._create_socket()
+        except Exception:
+            pass
+
+    def _decode_response(self, frames: list[bytes]) -> list[np.ndarray]:
+        try:
+            if len(frames) >= 2:
+                header = json.loads(frames[0].decode("utf-8"))
+                shape = tuple(header.get("shape", []))
+                dtype = np.dtype(header.get("dtype", "float32"))
+                return [np.frombuffer(frames[1], dtype=dtype).reshape(shape)]
+            elif len(frames) == 1:
+                # Raw float32 bytes — reshape to (1, embedding_dim)
+                arr = np.frombuffer(frames[0], dtype=np.float32)
+                return [arr.reshape((1, -1))]
+            else:
+                logger.warning(f"ZmqEmbeddingRunner({self._model_type}): empty reply")
+                return [np.zeros((1, self._get_output_dim()), dtype=np.float32)]
+        except Exception as exc:
+            logger.error(
+                f"ZmqEmbeddingRunner({self._model_type}): failed to decode response: {exc}"
+            )
+            return [np.zeros((1, self._get_output_dim()), dtype=np.float32)]
+
+    def _get_output_dim(self) -> int:
+        return self._OUTPUT_DIMS.get(self._model_type, 512)
+
+    def __del__(self) -> None:
+        try:
+            if self._socket is not None:
+                self._socket.close(linger=self._linger_ms)
+        except Exception:
+            pass
+
+
 def get_optimized_runner(
    model_path: str, device: str | None, model_type: str, **kwargs
 ) -> BaseModelRunner:
    """Get an optimized runner for the hardware."""
    device = device or "AUTO"

+    # ZMQ embedding runner — offloads ONNX inference to a native host process.
+    # Triggered when device is a ZMQ endpoint, e.g. "zmq://host.docker.internal:5556".
+    if device.startswith("zmq://"):
+        return ZmqEmbeddingRunner(endpoint=device, model_type=model_type)
+
    if device != "CPU" and is_rknn_compatible(model_path):
        rknn_path = auto_convert_model(model_path)

--- a/tools/zmq_embedding_server.py
+++ b/tools/zmq_embedding_server.py
@ -0,0 +1,275 @@
+"""ZMQ Embedding Server — native Mac (Apple Silicon) inference service.
+
+Runs ONNX models using hardware acceleration unavailable inside Docker on macOS,
+specifically CoreML and the Apple Neural Engine.  Frigate's Docker container
+connects to this server over ZMQ TCP, sends preprocessed tensors, and receives
+embedding vectors back.
+
+Supported models:
+  - ArcFace  (face recognition, 512-dim output)
+  - FaceNet  (face recognition, 128-dim output)
+  - Jina V1/V2 vision  (semantic search, 768-dim output)
+
+Requirements (install outside Docker, on the Mac host):
+    pip install onnxruntime pyzmq numpy
+
+Usage:
+    # ArcFace face recognition (port 5556):
+    python tools/zmq_embedding_server.py \\
+        --model /config/model_cache/facedet/arcface.onnx \\
+        --model-type arcface \\
+        --port 5556
+
+    # Jina V1 vision semantic search (port 5557):
+    python tools/zmq_embedding_server.py \\
+        --model /config/model_cache/jinaai/jina-clip-v1/vision_model_quantized.onnx \\
+        --model-type jina_v1 \\
+        --port 5557
+
+Frigate config (docker-compose / config.yaml):
+    face_recognition:
+      enabled: true
+      model_size: large
+      device: "zmq://host.docker.internal:5556"
+
+    semantic_search:
+      enabled: true
+      model_size: small
+      device: "zmq://host.docker.internal:5557"
+
+Protocol (REQ/REP):
+  Request:  multipart [ header_json_bytes, tensor_bytes ]
+    header = {
+        "shape":      [batch, channels, height, width],  # e.g. [1, 3, 112, 112]
+        "dtype":      "float32",
+        "model_type": "arcface",
+    }
+    tensor_bytes = raw C-order numpy bytes
+
+  Response: multipart [ header_json_bytes, embedding_bytes ]
+    header = {
+        "shape": [batch, embedding_dim],  # e.g. [1, 512]
+        "dtype": "float32",
+    }
+    embedding_bytes = raw C-order numpy bytes
+"""
+
+import argparse
+import json
+import logging
+import os
+import signal
+import sys
+import time
+
+import numpy as np
+import zmq
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+)
+logger = logging.getLogger("zmq_embedding_server")
+
+
+# Models that require ORT_ENABLE_BASIC optimization to avoid graph fusion issues
+# (e.g. SimplifiedLayerNormFusion creates nodes that some providers can't handle).
+_COMPLEX_MODELS = {"jina_v1", "jina_v2"}
+
+
+# ---------------------------------------------------------------------------
+# ONNX Runtime session (CoreML preferred on Apple Silicon)
+# ---------------------------------------------------------------------------
+
+def build_ort_session(model_path: str, model_type: str = ""):
+    """Create an ONNX Runtime InferenceSession, preferring CoreML on macOS.
+
+    Jina V1/V2 models use ORT_ENABLE_BASIC graph optimization to avoid
+    fusion passes (e.g. SimplifiedLayerNormFusion) that produce unsupported
+    nodes.  All other models use the default ORT_ENABLE_ALL.
+    """
+    import onnxruntime as ort
+
+    available = ort.get_available_providers()
+    logger.info(f"Available ORT providers: {available}")
+
+    # Prefer CoreMLExecutionProvider on Apple Silicon for ANE/GPU acceleration.
+    # Falls back automatically to CPUExecutionProvider if CoreML is unavailable.
+    preferred = []
+    if "CoreMLExecutionProvider" in available:
+        preferred.append("CoreMLExecutionProvider")
+        logger.info("Using CoreMLExecutionProvider (Apple Neural Engine / GPU)")
+    else:
+        logger.warning(
+            "CoreMLExecutionProvider not available — falling back to CPU. "
+            "Install onnxruntime-silicon or a CoreML-enabled onnxruntime build."
+        )
+
+    preferred.append("CPUExecutionProvider")
+
+    sess_options = None
+    if model_type in _COMPLEX_MODELS:
+        sess_options = ort.SessionOptions()
+        sess_options.graph_optimization_level = (
+            ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
+        )
+        logger.info(f"Using ORT_ENABLE_BASIC optimization for {model_type}")
+
+    session = ort.InferenceSession(model_path, sess_options=sess_options, providers=preferred)
+
+    input_names = [inp.name for inp in session.get_inputs()]
+    output_names = [out.name for out in session.get_outputs()]
+    logger.info(f"Model loaded: inputs={input_names}, outputs={output_names}")
+    return session
+
+
+# ---------------------------------------------------------------------------
+# Inference helpers
+# ---------------------------------------------------------------------------
+
+def run_arcface(session, tensor: np.ndarray) -> np.ndarray:
+    """Run ArcFace — input (1, 3, 112, 112) float32, output (1, 512) float32."""
+    outputs = session.run(None, {"data": tensor})
+    return outputs[0]  # shape (1, 512)
+
+
+def run_generic(session, tensor: np.ndarray) -> np.ndarray:
+    """Generic single-input ONNX model runner."""
+    input_name = session.get_inputs()[0].name
+    outputs = session.run(None, {input_name: tensor})
+    return outputs[0]
+
+
+_RUNNERS = {
+    "arcface": run_arcface,
+    "facenet": run_generic,
+    "jina_v1": run_generic,
+    "jina_v2": run_generic,
+}
+
+# Model type → input shape for warmup inference (triggers CoreML JIT compilation
+# before the first real request arrives, avoiding a ZMQ timeout on cold start).
+_WARMUP_SHAPES = {
+    "arcface": (1, 3, 112, 112),
+    "facenet": (1, 3, 160, 160),
+    "jina_v1": (1, 3, 224, 224),
+    "jina_v2": (1, 3, 224, 224),
+}
+
+
+def warmup(session, model_type: str) -> None:
+    """Run a dummy inference to trigger CoreML JIT compilation."""
+    shape = _WARMUP_SHAPES.get(model_type)
+    if shape is None:
+        return
+    logger.info(f"Warming up CoreML model ({model_type})…")
+    dummy = np.zeros(shape, dtype=np.float32)
+    try:
+        runner = _RUNNERS.get(model_type, run_generic)
+        runner(session, dummy)
+        logger.info("Warmup complete")
+    except Exception as exc:
+        logger.warning(f"Warmup failed (non-fatal): {exc}")
+
+
+# ---------------------------------------------------------------------------
+# ZMQ server loop
+# ---------------------------------------------------------------------------
+
+def serve(session, port: int, model_type: str) -> None:
+    context = zmq.Context()
+    socket = context.socket(zmq.REP)
+    socket.bind(f"tcp://0.0.0.0:{port}")
+    logger.info(f"Listening on tcp://0.0.0.0:{port} (model_type={model_type})")
+
+    runner = _RUNNERS.get(model_type, run_generic)
+
+    def _shutdown(sig, frame):
+        logger.info("Shutting down…")
+        socket.close(linger=0)
+        context.term()
+        sys.exit(0)
+
+    signal.signal(signal.SIGINT, _shutdown)
+    signal.signal(signal.SIGTERM, _shutdown)
+
+    while True:
+        try:
+            frames = socket.recv_multipart()
+        except zmq.ZMQError as exc:
+            logger.error(f"recv error: {exc}")
+            continue
+
+        if len(frames) < 2:
+            logger.warning(f"Received unexpected frame count: {len(frames)}, ignoring")
+            socket.send_multipart([b"{}"])
+            continue
+
+        try:
+            header = json.loads(frames[0].decode("utf-8"))
+            shape = tuple(header["shape"])
+            dtype = np.dtype(header.get("dtype", "float32"))
+            tensor = np.frombuffer(frames[1], dtype=dtype).reshape(shape)
+        except Exception as exc:
+            logger.error(f"Failed to decode request: {exc}")
+            socket.send_multipart([b"{}"])
+            continue
+
+        try:
+            t0 = time.monotonic()
+            embedding = runner(session, tensor)
+            elapsed_ms = (time.monotonic() - t0) * 1000
+            if elapsed_ms > 2000:
+                logger.warning(f"slow inference {elapsed_ms:.1f}ms shape={shape}")
+            resp_header = json.dumps(
+                {"shape": list(embedding.shape), "dtype": str(embedding.dtype.name)}
+            ).encode("utf-8")
+            resp_payload = memoryview(np.ascontiguousarray(embedding).tobytes())
+            socket.send_multipart([resp_header, resp_payload])
+        except Exception as exc:
+            logger.error(f"Inference error: {exc}")
+            # Return a zero embedding so the client can degrade gracefully
+            zero = np.zeros((1, 512), dtype=np.float32)
+            resp_header = json.dumps(
+                {"shape": list(zero.shape), "dtype": "float32"}
+            ).encode("utf-8")
+            socket.send_multipart([resp_header, memoryview(zero.tobytes())])
+
+
+# ---------------------------------------------------------------------------
+# Entry point
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(description="ZMQ Embedding Server for Frigate")
+    parser.add_argument(
+        "--model",
+        required=True,
+        help="Path to the ONNX model file (e.g. /config/model_cache/facedet/arcface.onnx)",
+    )
+    parser.add_argument(
+        "--model-type",
+        default="arcface",
+        choices=list(_RUNNERS.keys()),
+        help="Model type key (default: arcface)",
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=5556,
+        help="TCP port to listen on (default: 5556)",
+    )
+    args = parser.parse_args()
+
+    if not os.path.exists(args.model):
+        logger.error(f"Model file not found: {args.model}")
+        sys.exit(1)
+
+    logger.info(f"Loading model: {args.model}")
+    session = build_ort_session(args.model, model_type=args.model_type)
+    warmup(session, model_type=args.model_type)
+    serve(session, port=args.port, model_type=args.model_type)
+
+
+if __name__ == "__main__":
+    main()