Merge 33048ebc01 into e79ff9a079

Add built in support for memray memory debugging (#21057 )
[MemryX] Clean shutdown of detector process (#21035 )
2025-12-18 19:16:42 +03:00 · 2025-11-26 09:11:48 +08:00 · 2025-11-25 16:34:01 -06:00 · 2025-11-25 10:25:07 -07:00
6 changed files with 297 additions and 20 deletions
--- a/docker/main/requirements-wheels.txt
+++ b/docker/main/requirements-wheels.txt
@ -81,3 +81,5 @@ librosa==0.11.*
 soundfile==0.13.*
 # DeGirum detector
 degirum == 0.16.*
+# Memory profiling
+memray == 1.15.*
--- a/docs/docs/troubleshooting/memory.md
+++ b/docs/docs/troubleshooting/memory.md
@ -0,0 +1,129 @@
+---
+id: memory
+title: Memory Troubleshooting
+---
+
+Frigate includes built-in memory profiling using [memray](https://bloomberg.github.io/memray/) to help diagnose memory issues. This feature allows you to profile specific Frigate modules to identify memory leaks, excessive allocations, or other memory-related problems.
+
+## Enabling Memory Profiling
+
+Memory profiling is controlled via the `FRIGATE_MEMRAY_MODULES` environment variable. Set it to a comma-separated list of module names you want to profile:
+
+```bash
+export FRIGATE_MEMRAY_MODULES="frigate.review_segment_manager,frigate.capture"
+```
+
+### Module Names
+
+Frigate processes are named using a module-based naming scheme. Common module names include:
+
+- `frigate.review_segment_manager` - Review segment processing
+- `frigate.recording_manager` - Recording management
+- `frigate.capture` - Camera capture processes (all cameras with this module name)
+- `frigate.process` - Camera processing/tracking (all cameras with this module name)
+- `frigate.output` - Output processing
+- `frigate.audio_manager` - Audio processing
+- `frigate.embeddings` - Embeddings processing
+
+You can also specify the full process name (including camera-specific identifiers) if you want to profile a specific camera:
+
+```bash
+export FRIGATE_MEMRAY_MODULES="frigate.capture:front_door"
+```
+
+When you specify a module name (e.g., `frigate.capture`), all processes with that module prefix will be profiled. For example, `frigate.capture` will profile all camera capture processes.
+
+## How It Works
+
+1. **Binary File Creation**: When profiling is enabled, memray creates a binary file (`.bin`) in `/config/memray_reports/` that is updated continuously in real-time as the process runs.
+
+2. **Automatic HTML Generation**: On normal process exit, Frigate automatically:
+
+   - Stops memray tracking
+   - Generates an HTML flamegraph report
+   - Saves it to `/config/memray_reports/<module_name>.html`
+
+3. **Crash Recovery**: If a process crashes (SIGKILL, segfault, etc.), the binary file is preserved with all data up to the crash point. You can manually generate the HTML report from the binary file.
+
+## Viewing Reports
+
+### Automatic Reports
+
+After a process exits normally, you'll find HTML reports in `/config/memray_reports/`. Open these files in a web browser to view interactive flamegraphs showing memory usage patterns.
+
+### Manual Report Generation
+
+If a process crashes or you want to generate a report from an existing binary file, you can manually create the HTML report:
+
+```bash
+memray flamegraph /config/memray_reports/<module_name>.bin
+```
+
+This will generate an HTML file that you can open in your browser.
+
+## Understanding the Reports
+
+Memray flamegraphs show:
+
+- **Memory allocations over time**: See where memory is being allocated in your code
+- **Call stacks**: Understand the full call chain leading to allocations
+- **Memory hotspots**: Identify functions or code paths that allocate the most memory
+- **Memory leaks**: Spot patterns where memory is allocated but not freed
+
+The interactive HTML reports allow you to:
+
+- Zoom into specific time ranges
+- Filter by function names
+- View detailed allocation information
+- Export data for further analysis
+
+## Best Practices
+
+1. **Profile During Issues**: Enable profiling when you're experiencing memory issues, not all the time, as it adds some overhead.
+
+2. **Profile Specific Modules**: Instead of profiling everything, focus on the modules you suspect are causing issues.
+
+3. **Let Processes Run**: Allow processes to run for a meaningful duration to capture representative memory usage patterns.
+
+4. **Check Binary Files**: If HTML reports aren't generated automatically (e.g., after a crash), check for `.bin` files in `/config/memray_reports/` and generate reports manually.
+
+5. **Compare Reports**: Generate reports at different times to compare memory usage patterns and identify trends.
+
+## Troubleshooting
+
+### No Reports Generated
+
+- Check that the environment variable is set correctly
+- Verify the module name matches exactly (case-sensitive)
+- Check logs for memray-related errors
+- Ensure `/config/memray_reports/` directory exists and is writable
+
+### Process Crashed Before Report Generation
+
+- Look for `.bin` files in `/config/memray_reports/`
+- Manually generate HTML reports using: `memray flamegraph <file>.bin`
+- The binary file contains all data up to the crash point
+
+### Reports Show No Data
+
+- Ensure the process ran long enough to generate meaningful data
+- Check that memray is properly installed (included by default in Frigate)
+- Verify the process actually started and ran (check process logs)
+
+## Example Usage
+
+```bash
+# Enable profiling for review and capture modules
+export FRIGATE_MEMRAY_MODULES="frigate.review_segment_manager,frigate.capture"
+
+# Start Frigate
+# ... let it run for a while ...
+
+# Check for reports
+ls -lh /config/memray_reports/
+
+# If a process crashed, manually generate report
+memray flamegraph /config/memray_reports/frigate_capture_front_door.bin
+```
+
+For more information about memray and interpreting reports, see the [official memray documentation](https://bloomberg.github.io/memray/).
--- a/docs/sidebars.ts
+++ b/docs/sidebars.ts
@ -131,6 +131,7 @@ const sidebars: SidebarsConfig = {
      "troubleshooting/recordings",
      "troubleshooting/gpu",
      "troubleshooting/edgetpu",
+      "troubleshooting/memory",
    ],
    Development: [
      "development/contributing",
--- a/frigate/detectors/plugins/memryx.py
+++ b/frigate/detectors/plugins/memryx.py
@ -2,7 +2,6 @@ import glob
 import logging
 import os
 import shutil
-import time
 import urllib.request
 import zipfile
 from queue import Queue
@ -55,6 +54,9 @@ class MemryXDetector(DetectionApi):
            )
            return

+        # Initialize stop_event as None, will be set later by set_stop_event()
+        self.stop_event = None
+
        model_cfg = getattr(detector_config, "model", None)

        # Check if model_type was explicitly set by the user
@ -363,26 +365,43 @@ class MemryXDetector(DetectionApi):
    def process_input(self):
        """Input callback function: wait for frames in the input queue, preprocess, and send to MX3 (return)"""
        while True:
+            # Check if shutdown is requested
+            if self.stop_event and self.stop_event.is_set():
+                logger.debug("[process_input] Stop event detected, returning None")
+                return None
            try:
-                # Wait for a frame from the queue (blocking call)
-                frame = self.capture_queue.get(
-                    block=True
-                )  # Blocks until data is available
+                # Wait for a frame from the queue with timeout to check stop_event periodically
+                frame = self.capture_queue.get(block=True, timeout=0.5)

                return frame

            except Exception as e:
-                logger.info(f"[process_input] Error processing input: {e}")
-                time.sleep(0.1)  # Prevent busy waiting in case of error
+                # Silently handle queue.Empty timeouts (expected during normal operation)
+                # Log any other unexpected exceptions
+                if "Empty" not in str(type(e).__name__):
+                    logger.warning(f"[process_input] Unexpected error: {e}")
+                # Loop continues and will check stop_event at the top

    def receive_output(self):
        """Retrieve processed results from MemryX output queue + a copy of the original frame"""
-        connection_id = (
-            self.capture_id_queue.get()
-        )  # Get the corresponding connection ID
-        detections = self.output_queue.get()  # Get detections from MemryX
+        try:
+            # Get connection ID with timeout
+            connection_id = self.capture_id_queue.get(
+                block=True, timeout=1.0
+            )  # Get the corresponding connection ID
+            detections = self.output_queue.get()  # Get detections from MemryX

-        return connection_id, detections
+            return connection_id, detections
+
+        except Exception as e:
+            # On timeout or stop event, return None
+            if self.stop_event and self.stop_event.is_set():
+                logger.debug("[receive_output] Stop event detected, exiting")
+            # Silently handle queue.Empty timeouts, they're expected during normal operation
+            elif "Empty" not in str(type(e).__name__):
+                logger.warning(f"[receive_output] Error receiving output: {e}")
+
+            return None, None

    def post_process_yolonas(self, output):
        predictions = output[0]
@ -831,6 +850,19 @@ class MemryXDetector(DetectionApi):
                f"{self.memx_model_type} is currently not supported for memryx. See the docs for more info on supported models."
            )

+    def set_stop_event(self, stop_event):
+        """Set the stop event for graceful shutdown."""
+        self.stop_event = stop_event
+
+    def shutdown(self):
+        """Gracefully shutdown the MemryX accelerator"""
+        try:
+            if hasattr(self, "accl") and self.accl is not None:
+                self.accl.shutdown()
+                logger.info("MemryX accelerator shutdown complete")
+        except Exception as e:
+            logger.error(f"Error during MemryX shutdown: {e}")
+
    def detect_raw(self, tensor_input: np.ndarray):
        """Removed synchronous detect_raw() function so that we only use async"""
        return 0
--- a/frigate/object_detection/base.py
+++ b/frigate/object_detection/base.py
@ -43,6 +43,7 @@ class BaseLocalDetector(ObjectDetector):
        self,
        detector_config: BaseDetectorConfig = None,
        labels: str = None,
+        stop_event: MpEvent = None,
    ):
        self.fps = EventsPerSecond()
        if labels is None:
@ -60,6 +61,10 @@ class BaseLocalDetector(ObjectDetector):

        self.detect_api = create_detector(detector_config)

+        # If the detector supports stop_event, pass it
+        if hasattr(self.detect_api, "set_stop_event") and stop_event:
+            self.detect_api.set_stop_event(stop_event)
+
    def _transform_input(self, tensor_input: np.ndarray) -> np.ndarray:
        if self.input_transform:
            tensor_input = np.transpose(tensor_input, self.input_transform)
@ -240,6 +245,10 @@ class AsyncDetectorRunner(FrigateProcess):
        while not self.stop_event.is_set():
            connection_id, detections = self._detector.async_receive_output()

+            # Handle timeout case (queue.Empty) - just continue
+            if connection_id is None:
+                continue
+
            if not self.send_times:
                # guard; shouldn't happen if send/recv are balanced
                continue
@ -266,21 +275,38 @@ class AsyncDetectorRunner(FrigateProcess):

        self._frame_manager = SharedMemoryFrameManager()
        self._publisher = ObjectDetectorPublisher()
-        self._detector = AsyncLocalObjectDetector(detector_config=self.detector_config)
+        self._detector = AsyncLocalObjectDetector(
+            detector_config=self.detector_config, stop_event=self.stop_event
+        )

        for name in self.cameras:
            self.create_output_shm(name)

-        t_detect = threading.Thread(target=self._detect_worker, daemon=True)
-        t_result = threading.Thread(target=self._result_worker, daemon=True)
+        t_detect = threading.Thread(target=self._detect_worker, daemon=False)
+        t_result = threading.Thread(target=self._result_worker, daemon=False)
        t_detect.start()
        t_result.start()

-        while not self.stop_event.is_set():
-            time.sleep(0.5)
+        try:
+            while not self.stop_event.is_set():
+                time.sleep(0.5)

-        self._publisher.stop()
-        logger.info("Exited async detection process...")
+            logger.info(
+                "Stop event detected, waiting for detector threads to finish..."
+            )
+
+            # Wait for threads to finish processing
+            t_detect.join(timeout=5)
+            t_result.join(timeout=5)
+
+            # Shutdown the AsyncDetector
+            self._detector.detect_api.shutdown()
+
+            self._publisher.stop()
+        except Exception as e:
+            logger.error(f"Error during async detector shutdown: {e}")
+        finally:
+            logger.info("Exited Async detection process...")


 class ObjectDetectProcess:
@ -308,7 +334,7 @@ class ObjectDetectProcess:
        # if the process has already exited on its own, just return
        if self.detect_process and self.detect_process.exitcode:
            return
-        self.detect_process.terminate()
+
        logging.info("Waiting for detection process to exit gracefully...")
        self.detect_process.join(timeout=30)
        if self.detect_process.exitcode is None:
--- a/frigate/util/process.py
+++ b/frigate/util/process.py
@ -1,7 +1,10 @@
+import atexit
 import faulthandler
 import logging
 import multiprocessing as mp
 import os
+import pathlib
+import subprocess
 import threading
 from logging.handlers import QueueHandler
 from multiprocessing.synchronize import Event as MpEvent
@ -48,6 +51,7 @@ class FrigateProcess(BaseProcess):

    def before_start(self) -> None:
        self.__log_queue = frigate.log.log_listener.queue
+        self.__memray_tracker = None

    def pre_run_setup(self, logConfig: LoggerConfig | None = None) -> None:
        os.nice(self.priority)
@ -64,3 +68,86 @@ class FrigateProcess(BaseProcess):
            frigate.log.apply_log_levels(
                logConfig.default.value.upper(), logConfig.logs
            )
+
+        self._setup_memray()
+
+    def _setup_memray(self) -> None:
+        """Setup memray profiling if enabled via environment variable."""
+        memray_modules = os.environ.get("FRIGATE_MEMRAY_MODULES", "")
+
+        if not memray_modules:
+            return
+
+        # Extract module name from process name (e.g., "frigate.capture:camera" -> "frigate.capture")
+        process_name = self.name
+        module_name = (
+            process_name.split(":")[0] if ":" in process_name else process_name
+        )
+
+        enabled_modules = [m.strip() for m in memray_modules.split(",")]
+
+        if module_name not in enabled_modules and process_name not in enabled_modules:
+            return
+
+        try:
+            import memray
+
+            reports_dir = pathlib.Path("/config/memray_reports")
+            reports_dir.mkdir(parents=True, exist_ok=True)
+            safe_name = (
+                process_name.replace(":", "_").replace("/", "_").replace("\\", "_")
+            )
+
+            binary_file = reports_dir / f"{safe_name}.bin"
+
+            self.__memray_tracker = memray.Tracker(str(binary_file))
+            self.__memray_tracker.__enter__()
+
+            # Register cleanup handler to stop tracking and generate HTML report
+            # atexit runs on normal exits and most signal-based terminations (SIGTERM, SIGINT)
+            # For hard kills (SIGKILL) or segfaults, the binary file is preserved for manual generation
+            atexit.register(self._cleanup_memray, safe_name, binary_file)
+
+            self.logger.info(
+                f"Memray profiling enabled for module {module_name} (process: {self.name}). "
+                f"Binary file (updated continuously): {binary_file}. "
+                f"HTML report will be generated on exit: {reports_dir}/{safe_name}.html. "
+                f"If process crashes, manually generate with: memray flamegraph {binary_file}"
+            )
+        except Exception as e:
+            self.logger.error(f"Failed to setup memray profiling: {e}", exc_info=True)
+
+    def _cleanup_memray(self, safe_name: str, binary_file: pathlib.Path) -> None:
+        """Stop memray tracking and generate HTML report."""
+        if self.__memray_tracker is None:
+            return
+
+        try:
+            self.__memray_tracker.__exit__(None, None, None)
+            self.__memray_tracker = None
+
+            reports_dir = pathlib.Path("/config/memray_reports")
+            html_file = reports_dir / f"{safe_name}.html"
+
+            result = subprocess.run(
+                ["memray", "flamegraph", "--output", str(html_file), str(binary_file)],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+
+            if result.returncode == 0:
+                self.logger.info(f"Memray report generated: {html_file}")
+            else:
+                self.logger.error(
+                    f"Failed to generate memray report: {result.stderr}. "
+                    f"Binary file preserved at {binary_file} for manual generation."
+                )
+
+            # Keep the binary file for manual report generation if needed
+            # Users can run: memray flamegraph {binary_file}
+
+        except subprocess.TimeoutExpired:
+            self.logger.error("Memray report generation timed out")
+        except Exception as e:
+            self.logger.error(f"Failed to cleanup memray profiling: {e}", exc_info=True)
Author	SHA1	Message	Date
GuoQing Liu	4f322af577	Merge `33048ebc01` into `e79ff9a079`	2025-11-26 09:11:48 +08:00
Nicolas Mowen	e79ff9a079	Add built in support for memray memory debugging (#21057 ) Some checks are pending CI / AMD64 Build (push) Waiting to run Details CI / ARM Build (push) Waiting to run Details CI / Jetson Jetpack 6 (push) Waiting to run Details CI / AMD64 Extra Build (push) Blocked by required conditions Details CI / ARM Extra Build (push) Blocked by required conditions Details CI / Synaptics Build (push) Blocked by required conditions Details CI / Assemble and push default build (push) Blocked by required conditions Details	2025-11-25 16:34:01 -06:00
Abinila Siva	fe47620153	[MemryX] Clean shutdown of detector process (#21035 ) Some checks are pending CI / AMD64 Build (push) Waiting to run Details CI / ARM Build (push) Waiting to run Details CI / Jetson Jetpack 6 (push) Waiting to run Details CI / AMD64 Extra Build (push) Blocked by required conditions Details CI / ARM Extra Build (push) Blocked by required conditions Details CI / Synaptics Build (push) Blocked by required conditions Details CI / Assemble and push default build (push) Blocked by required conditions Details * update code for clean exit * ruff format * remove unused time import * update stop_event handling * remove hasattr check	2025-11-25 10:25:07 -07:00