diff --git a/docs/docs/configuration/hardware_acceleration_enrichments.md b/docs/docs/configuration/hardware_acceleration_enrichments.md index 84688b8b4..45c7cd4d1 100644 --- a/docs/docs/configuration/hardware_acceleration_enrichments.md +++ b/docs/docs/configuration/hardware_acceleration_enrichments.md @@ -5,7 +5,7 @@ title: Enrichments # Enrichments -Some of Frigate's enrichments can use a discrete GPU / NPU for accelerated processing. +Some of Frigate's enrichments can use a discrete GPU or integrated GPU for accelerated processing. ## Requirements @@ -18,8 +18,10 @@ Object detection and enrichments (like Semantic Search, Face Recognition, and Li - **Intel** - OpenVINO will automatically be detected and used for enrichments in the default Frigate image. + - **Note:** Intel NPUs have limited model support for enrichments. GPU is recommended for enrichments when available. - **Nvidia** + - Nvidia GPUs will automatically be detected and used for enrichments in the `-tensorrt` Frigate image. - Jetson devices will automatically be detected and used for enrichments in the `-tensorrt-jp6` Frigate image. diff --git a/docs/docs/configuration/object_detectors.md b/docs/docs/configuration/object_detectors.md index 2da602979..2dd3330c2 100644 --- a/docs/docs/configuration/object_detectors.md +++ b/docs/docs/configuration/object_detectors.md @@ -261,6 +261,8 @@ OpenVINO is supported on 6th Gen Intel platforms (Skylake) and newer. It will al :::tip +**NPU + GPU Systems:** If you have both NPU and GPU available (Intel Core Ultra processors), use NPU for object detection and GPU for enrichments (semantic search, face recognition, etc.) for best performance and compatibility. + When using many cameras one detector may not be enough to keep up. Multiple detectors can be defined assuming GPU resources are available. An example configuration would be: ```yaml @@ -283,7 +285,7 @@ detectors: | [RF-DETR](#rf-detr) | ✅ | ✅ | Requires XE iGPU or Arc | | [YOLO-NAS](#yolo-nas) | ✅ | ✅ | | | [MobileNet v2](#ssdlite-mobilenet-v2) | ✅ | ✅ | Fast and lightweight model, less accurate than larger models | -| [YOLOX](#yolox) | ✅ | ? | | +| [YOLOX](#yolox) | ✅ | ? | | | [D-FINE](#d-fine) | ❌ | ❌ | | #### SSDLite MobileNet v2 diff --git a/docs/docs/configuration/semantic_search.md b/docs/docs/configuration/semantic_search.md index 70ef161be..b3ce37177 100644 --- a/docs/docs/configuration/semantic_search.md +++ b/docs/docs/configuration/semantic_search.md @@ -78,7 +78,7 @@ Switching between V1 and V2 requires reindexing your embeddings. The embeddings ### GPU Acceleration -The CLIP models are downloaded in ONNX format, and the `large` model can be accelerated using GPU / NPU hardware, when available. This depends on the Docker build that is used. You can also target a specific device in a multi-GPU installation. +The CLIP models are downloaded in ONNX format, and the `large` model can be accelerated using GPU hardware, when available. This depends on the Docker build that is used. You can also target a specific device in a multi-GPU installation. ```yaml semantic_search: @@ -90,7 +90,7 @@ semantic_search: :::info -If the correct build is used for your GPU / NPU and the `large` model is configured, then the GPU / NPU will be detected and used automatically. +If the correct build is used for your GPU / NPU and the `large` model is configured, then the GPU will be detected and used automatically. Specify the `device` option to target a specific GPU in a multi-GPU system (see [onnxruntime's provider options](https://onnxruntime.ai/docs/execution-providers/)). If you do not specify a device, the first available GPU will be used. diff --git a/frigate/data_processing/real_time/custom_classification.py b/frigate/data_processing/real_time/custom_classification.py index e4048b6ec..7d0975712 100644 --- a/frigate/data_processing/real_time/custom_classification.py +++ b/frigate/data_processing/real_time/custom_classification.py @@ -418,8 +418,8 @@ class CustomObjectClassificationProcessor(RealTimeProcessorApi): obj_data["box"][2], obj_data["box"][3], max( - obj_data["box"][1] - obj_data["box"][0], - obj_data["box"][3] - obj_data["box"][2], + obj_data["box"][2] - obj_data["box"][0], + obj_data["box"][3] - obj_data["box"][1], ), 1.0, ) @@ -546,5 +546,8 @@ def write_classification_attempt( ) # delete oldest face image if maximum is reached - if len(files) > max_files: - os.unlink(os.path.join(folder, files[-1])) + try: + if len(files) > max_files: + os.unlink(os.path.join(folder, files[-1])) + except FileNotFoundError: + pass diff --git a/frigate/detectors/detection_runners.py b/frigate/detectors/detection_runners.py index 5b45238be..5cadb3d52 100644 --- a/frigate/detectors/detection_runners.py +++ b/frigate/detectors/detection_runners.py @@ -3,6 +3,7 @@ import logging import os import platform +import threading from abc import ABC, abstractmethod from typing import Any @@ -161,12 +162,12 @@ class CudaGraphRunner(BaseModelRunner): """ @staticmethod - def is_complex_model(model_type: str) -> bool: + def is_model_supported(model_type: str) -> bool: # Import here to avoid circular imports from frigate.detectors.detector_config import ModelTypeEnum from frigate.embeddings.types import EnrichmentModelTypeEnum - return model_type in [ + return model_type not in [ ModelTypeEnum.yolonas.value, EnrichmentModelTypeEnum.paddleocr.value, EnrichmentModelTypeEnum.jina_v1.value, @@ -239,9 +240,30 @@ class OpenVINOModelRunner(BaseModelRunner): EnrichmentModelTypeEnum.jina_v2.value, ] + @staticmethod + def is_model_npu_supported(model_type: str) -> bool: + # Import here to avoid circular imports + from frigate.embeddings.types import EnrichmentModelTypeEnum + + return model_type not in [ + EnrichmentModelTypeEnum.paddleocr.value, + EnrichmentModelTypeEnum.jina_v1.value, + EnrichmentModelTypeEnum.jina_v2.value, + EnrichmentModelTypeEnum.arcface.value, + ] + def __init__(self, model_path: str, device: str, model_type: str, **kwargs): self.model_path = model_path self.device = device + + if device == "NPU" and not OpenVINOModelRunner.is_model_npu_supported( + model_type + ): + logger.warning( + f"OpenVINO model {model_type} is not supported on NPU, using GPU instead" + ) + device = "GPU" + self.complex_model = OpenVINOModelRunner.is_complex_model(model_type) if not os.path.isfile(model_path): @@ -269,6 +291,10 @@ class OpenVINOModelRunner(BaseModelRunner): self.infer_request = self.compiled_model.create_infer_request() self.input_tensor: ov.Tensor | None = None + # Thread lock to prevent concurrent inference (needed for JinaV2 which shares + # one runner between text and vision embeddings called from different threads) + self._inference_lock = threading.Lock() + if not self.complex_model: try: input_shape = self.compiled_model.inputs[0].get_shape() @@ -312,67 +338,70 @@ class OpenVINOModelRunner(BaseModelRunner): Returns: List of output tensors """ - # Handle single input case for backward compatibility - if ( - len(inputs) == 1 - and len(self.compiled_model.inputs) == 1 - and self.input_tensor is not None - ): - # Single input case - use the pre-allocated tensor for efficiency - input_data = list(inputs.values())[0] - np.copyto(self.input_tensor.data, input_data) - self.infer_request.infer(self.input_tensor) - else: - if self.complex_model: - try: - # This ensures the model starts with a clean state for each sequence - # Important for RNN models like PaddleOCR recognition - self.infer_request.reset_state() - except Exception: - # this will raise an exception for models with AUTO set as the device - pass + # Lock prevents concurrent access to infer_request + # Needed for JinaV2: genai thread (text) + embeddings thread (vision) + with self._inference_lock: + # Handle single input case for backward compatibility + if ( + len(inputs) == 1 + and len(self.compiled_model.inputs) == 1 + and self.input_tensor is not None + ): + # Single input case - use the pre-allocated tensor for efficiency + input_data = list(inputs.values())[0] + np.copyto(self.input_tensor.data, input_data) + self.infer_request.infer(self.input_tensor) + else: + if self.complex_model: + try: + # This ensures the model starts with a clean state for each sequence + # Important for RNN models like PaddleOCR recognition + self.infer_request.reset_state() + except Exception: + # this will raise an exception for models with AUTO set as the device + pass - # Multiple inputs case - set each input by name - for input_name, input_data in inputs.items(): - # Find the input by name and its index - input_port = None - input_index = None - for idx, port in enumerate(self.compiled_model.inputs): - if port.get_any_name() == input_name: - input_port = port - input_index = idx - break + # Multiple inputs case - set each input by name + for input_name, input_data in inputs.items(): + # Find the input by name and its index + input_port = None + input_index = None + for idx, port in enumerate(self.compiled_model.inputs): + if port.get_any_name() == input_name: + input_port = port + input_index = idx + break - if input_port is None: - raise ValueError(f"Input '{input_name}' not found in model") + if input_port is None: + raise ValueError(f"Input '{input_name}' not found in model") - # Create tensor with the correct element type - input_element_type = input_port.get_element_type() + # Create tensor with the correct element type + input_element_type = input_port.get_element_type() - # Ensure input data matches the expected dtype to prevent type mismatches - # that can occur with models like Jina-CLIP v2 running on OpenVINO - expected_dtype = input_element_type.to_dtype() - if input_data.dtype != expected_dtype: - logger.debug( - f"Converting input '{input_name}' from {input_data.dtype} to {expected_dtype}" - ) - input_data = input_data.astype(expected_dtype) + # Ensure input data matches the expected dtype to prevent type mismatches + # that can occur with models like Jina-CLIP v2 running on OpenVINO + expected_dtype = input_element_type.to_dtype() + if input_data.dtype != expected_dtype: + logger.debug( + f"Converting input '{input_name}' from {input_data.dtype} to {expected_dtype}" + ) + input_data = input_data.astype(expected_dtype) - input_tensor = ov.Tensor(input_element_type, input_data.shape) - np.copyto(input_tensor.data, input_data) + input_tensor = ov.Tensor(input_element_type, input_data.shape) + np.copyto(input_tensor.data, input_data) - # Set the input tensor for the specific port index - self.infer_request.set_input_tensor(input_index, input_tensor) + # Set the input tensor for the specific port index + self.infer_request.set_input_tensor(input_index, input_tensor) - # Run inference - self.infer_request.infer() + # Run inference + self.infer_request.infer() - # Get all output tensors - outputs = [] - for i in range(len(self.compiled_model.outputs)): - outputs.append(self.infer_request.get_output_tensor(i).data) + # Get all output tensors + outputs = [] + for i in range(len(self.compiled_model.outputs)): + outputs.append(self.infer_request.get_output_tensor(i).data) - return outputs + return outputs class RKNNModelRunner(BaseModelRunner): @@ -500,7 +529,7 @@ def get_optimized_runner( return OpenVINOModelRunner(model_path, device, model_type, **kwargs) if ( - not CudaGraphRunner.is_complex_model(model_type) + not CudaGraphRunner.is_model_supported(model_type) and providers[0] == "CUDAExecutionProvider" ): options[0] = { diff --git a/frigate/genai/__init__.py b/frigate/genai/__init__.py index 13d57046c..1c9c2b943 100644 --- a/frigate/genai/__init__.py +++ b/frigate/genai/__init__.py @@ -113,8 +113,8 @@ When forming your description: ## Response Format Your response MUST be a flat JSON object with: -- `title` (string): A concise, direct title that describes the purpose or overall action, not just what you literally see. {"Use spatial context when available to make titles more meaningful." if camera_context_section else ""} Use names from "Objects in Scene" based on what you visually observe. If you see both a name and an unidentified object of the same type but visually observe only one person/object, use ONLY the name. Examples: "Joe walking dog", "Person taking out trash", "Joe accessing vehicle", "Person leaving porch for driveway", "Joe and person on front porch". -- `scene` (string): A narrative description of what happens across the sequence from start to finish. **Only describe actions you can actually observe happening in the frames provided.** Do not infer or assume actions that aren't visible (e.g., if you see someone walking but never see them sit, don't say they sat down). Include setting, detected objects, and their observable actions. Avoid speculation or filling in assumed behaviors. Your description should align with and support the threat level you assign. +- `title` (string): A concise, direct title that describes the primary action or event in the sequence, not just what you literally see. {"Use spatial context when available to make titles more meaningful." if camera_context_section else ""} When multiple objects/actions are present, prioritize whichever is most prominent or occurs first. Use names from "Objects in Scene" based on what you visually observe. If you see both a name and an unidentified object of the same type but visually observe only one person/object, use ONLY the name. Examples: "Joe walking dog", "Person taking out trash", "Vehicle arriving in driveway", "Joe accessing vehicle", "Person leaving porch for driveway". +- `scene` (string): A narrative description of what happens across the sequence from start to finish, in chronological order. Start by describing how the sequence begins, then describe the progression of events. **Describe all significant movements and actions in the order they occur.** For example, if a vehicle arrives and then a person exits, describe both actions sequentially. **Only describe actions you can actually observe happening in the frames provided.** Do not infer or assume actions that aren't visible (e.g., if you see someone walking but never see them sit, don't say they sat down). Include setting, detected objects, and their observable actions. Avoid speculation or filling in assumed behaviors. Your description should align with and support the threat level you assign. - `confidence` (float): 0-1 confidence in your analysis. Higher confidence when objects/actions are clearly visible and context is unambiguous. Lower confidence when the sequence is unclear, objects are partially obscured, or context is ambiguous. - `potential_threat_level` (integer): 0, 1, or 2 as defined in "Normal Activity Patterns for This Property" above. Your threat level must be consistent with your scene description and the guidance above. {get_concern_prompt()} diff --git a/web/src/components/card/ClassificationCard.tsx b/web/src/components/card/ClassificationCard.tsx index cd7d89827..4511c22d5 100644 --- a/web/src/components/card/ClassificationCard.tsx +++ b/web/src/components/card/ClassificationCard.tsx @@ -148,13 +148,13 @@ export const ClassificationCard = forwardRef<