Genai docs refactor & fixes (#22175)

* Improve GenAI docs * Clarify * Fix config updating * Implement streaming for other providers * Set openai base url if applied * Cast context size
2026-07-08 21:11:25 +03:00 · 2026-02-28 10:40:26 -07:00 · 2026-02-28 10:40:26 -07:00 · 4232cc483d
commit 4232cc483d
parent 6a21b2952d
7 changed files with 624 additions and 108 deletions
--- a/docs/docs/configuration/genai/config.md
+++ b/docs/docs/configuration/genai/config.md
@ -5,39 +5,31 @@ title: Configuring Generative AI

 ## Configuration

-A Generative AI provider can be configured in the global config, which will make the Generative AI features available for use. There are currently 4 native providers available to integrate with Frigate. Other providers that support the OpenAI standard API can also be used. See the OpenAI section below.
+A Generative AI provider can be configured in the global config, which will make the Generative AI features available for use. There are currently 4 native providers available to integrate with Frigate. Other providers that support the OpenAI standard API can also be used. See the OpenAI-Compatible section below.

 To use Generative AI, you must define a single provider at the global level of your Frigate configuration. If the provider you choose requires an API key, you may either directly paste it in your configuration, or store it in an environment variable prefixed with `FRIGATE_`.

-## Ollama
+## Local Providers
+
+Local providers run on your own hardware and keep all data processing private. These require a GPU or dedicated hardware for best performance.

 :::warning

-Using Ollama on CPU is not recommended, high inference times make using Generative AI impractical.
+Running Generative AI models on CPU is not recommended, as high inference times make using Generative AI impractical.

 :::

-[Ollama](https://ollama.com/) allows you to self-host large language models and keep everything running locally. It is highly recommended to host this server on a machine with an Nvidia graphics card, or on a Apple silicon Mac for best performance.
+### Recommended Local Models

-Most of the 7b parameter 4-bit vision models will fit inside 8GB of VRAM. There is also a [Docker container](https://hub.docker.com/r/ollama/ollama) available.
+You must use a vision-capable model with Frigate. The following models are recommended for local deployment:

-Parallel requests also come with some caveats. You will need to set `OLLAMA_NUM_PARALLEL=1` and choose a `OLLAMA_MAX_QUEUE` and `OLLAMA_MAX_LOADED_MODELS` values that are appropriate for your hardware and preferences. See the [Ollama documentation](https://docs.ollama.com/faq#how-does-ollama-handle-concurrent-requests).
-
-### Model Types: Instruct vs Thinking
-
-Most vision-language models are available as **instruct** models, which are fine-tuned to follow instructions and respond concisely to prompts. However, some models (such as certain Qwen-VL or minigpt variants) offer both **instruct** and **thinking** versions.
-
- **Instruct models** are always recommended for use with Frigate. These models generate direct, relevant, actionable descriptions that best fit Frigate's object and event summary use case.
- **Thinking models** are fine-tuned for more free-form, open-ended, and speculative outputs, which are typically not concise and may not provide the practical summaries Frigate expects. For this reason, Frigate does **not** recommend or support using thinking models.
-
-Some models are labeled as **hybrid** (capable of both thinking and instruct tasks). In these cases, Frigate will always use instruct-style prompts and specifically disables thinking-mode behaviors to ensure concise, useful responses.
-
-**Recommendation:**
-Always select the `-instruct` or documented instruct/tagged variant of any model you use in your Frigate configuration. If in doubt, refer to your model provider’s documentation or model library for guidance on the correct model variant to use.
-
-### Supported Models
-
-You must use a vision capable model with Frigate. Current model variants can be found [in their model library](https://ollama.com/library). Note that Frigate will not automatically download the model you specify in your config, Ollama will try to download the model but it may take longer than the timeout, it is recommended to pull the model beforehand by running `ollama pull your_model` on your Ollama server/Docker container. Note that the model specified in Frigate's config must match the downloaded model tag.
+| Model         | Notes                                                                                                                                                                |
+| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `qwen3-vl`    | Strong visual and situational understanding, strong ability to identify smaller objects and interactions with object.                                                |
+| `qwen3.5`     | Strong situational understanding, but missing DeepStack from qwen3-vl leading to worse performance for identifying objects in people's hand and other small details. |
+| `Intern3.5VL` | Relatively fast with good vision comprehension                                                                                                                       |
+| `gemma3`      | Slower model with good vision and temporal understanding                                                                                                             |
+| `qwen2.5-vl`  | Fast but capable model with good vision comprehension                                                                                                                |

 :::info

@ -45,32 +37,64 @@ Each model is available in multiple parameter sizes (3b, 4b, 8b, etc.). Larger s

 :::

+:::note
+
+You should have at least 8 GB of RAM available (or VRAM if running on GPU) to run the 7B models, 16 GB to run the 13B models, and 24 GB to run the 33B models.
+
+:::
+
+### Model Types: Instruct vs Thinking
+
+Most vision-language models are available as **instruct** models, which are fine-tuned to follow instructions and respond concisely to prompts. However, some models (such as certain Qwen-VL or minigpt variants) offer both **instruct** and **thinking** versions.
+
+- **Instruct models** are always recommended for use with Frigate. These models generate direct, relevant, actionable descriptions that best fit Frigate's object and event summary use case.
+- **Reasoning / Thinking models** are fine-tuned for more free-form, open-ended, and speculative outputs, which are typically not concise and may not provide the practical summaries Frigate expects. For this reason, Frigate does **not** recommend or support using thinking models.
+
+Some models are labeled as **hybrid** (capable of both thinking and instruct tasks). In these cases, it is recommended to disable reasoning / thinking, which is generally model specific (see your models documentation).
+
+**Recommendation:**
+Always select the `-instruct` or documented instruct/tagged variant of any model you use in your Frigate configuration. If in doubt, refer to your model provider's documentation or model library for guidance on the correct model variant to use.
+
+### llama.cpp
+
+[llama.cpp](https://github.com/ggml-org/llama.cpp) is a C++ implementation of LLaMA that provides a high-performance inference server.
+
+It is highly recommended to host the llama.cpp server on a machine with a discrete graphics card, or on an Apple silicon Mac for best performance.
+
+#### Supported Models
+
+You must use a vision capable model with Frigate. The llama.cpp server supports various vision models in GGUF format.
+
+#### Configuration
+
+All llama.cpp native options can be passed through `provider_options`, including `temperature`, `top_k`, `top_p`, `min_p`, `repeat_penalty`, `repeat_last_n`, `seed`, `grammar`, and more. See the [llama.cpp server documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) for a complete list of available parameters.
+
+```yaml
+genai:
+  provider: llamacpp
+  base_url: http://localhost:8080
+  model: your-model-name
+  provider_options:
+    context_size: 16000 # Tell Frigate your context size so it can send the appropriate amount of information.
+```
+
+### Ollama
+
+[Ollama](https://ollama.com/) allows you to self-host large language models and keep everything running locally. It is highly recommended to host this server on a machine with an Nvidia graphics card, or on a Apple silicon Mac for best performance.
+
+Most of the 7b parameter 4-bit vision models will fit inside 8GB of VRAM. There is also a [Docker container](https://hub.docker.com/r/ollama/ollama) available.
+
+Parallel requests also come with some caveats. You will need to set `OLLAMA_NUM_PARALLEL=1` and choose a `OLLAMA_MAX_QUEUE` and `OLLAMA_MAX_LOADED_MODELS` values that are appropriate for your hardware and preferences. See the [Ollama documentation](https://docs.ollama.com/faq#how-does-ollama-handle-concurrent-requests).
+
 :::tip

 If you are trying to use a single model for Frigate and HomeAssistant, it will need to support vision and tools calling. qwen3-VL supports vision and tools simultaneously in Ollama.

 :::

-The following models are recommended:
+Note that Frigate will not automatically download the model you specify in your config. Ollama will try to download the model but it may take longer than the timeout, so it is recommended to pull the model beforehand by running `ollama pull your_model` on your Ollama server/Docker container. The model specified in Frigate's config must match the downloaded model tag.

-| Model         | Notes                                                                |
-| ------------- | -------------------------------------------------------------------- |
-| `qwen3-vl`    | Strong visual and situational understanding, higher vram requirement |
-| `Intern3.5VL` | Relatively fast with good vision comprehension                       |
-| `gemma3`      | Strong frame-to-frame understanding, slower inference times          |
-| `qwen2.5-vl`  | Fast but capable model with good vision comprehension                |
-
-:::note
-
-You should have at least 8 GB of RAM available (or VRAM if running on GPU) to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
-
-:::
-
-#### Ollama Cloud models
-
-Ollama also supports [cloud models](https://ollama.com/cloud), where your local Ollama instance handles requests from Frigate, but model inference is performed in the cloud. Set up Ollama locally, sign in with your Ollama account, and specify the cloud model name in your Frigate config. For more details, see the Ollama cloud model [docs](https://docs.ollama.com/cloud).
-
-### Configuration
+#### Configuration

 ```yaml
 genai:
@ -83,49 +107,65 @@ genai:
      num_ctx: 8192 # make sure the context matches other services that are using ollama
 ```

-## llama.cpp
+### OpenAI-Compatible

-[llama.cpp](https://github.com/ggml-org/llama.cpp) is a C++ implementation of LLaMA that provides a high-performance inference server. Using llama.cpp directly gives you access to all native llama.cpp options and parameters.
+Frigate supports any provider that implements the OpenAI API standard. This includes self-hosted solutions like [vLLM](https://docs.vllm.ai/), [LocalAI](https://localai.io/), and other OpenAI-compatible servers.

-:::warning
+:::tip

-Using llama.cpp on CPU is not recommended, high inference times make using Generative AI impractical.
-
-:::
-
-It is highly recommended to host the llama.cpp server on a machine with a discrete graphics card, or on an Apple silicon Mac for best performance.
-
-### Supported Models
-
-You must use a vision capable model with Frigate. The llama.cpp server supports various vision models in GGUF format.
-
-### Configuration
+For OpenAI-compatible servers (such as llama.cpp) that don't expose the configured context size in the API response, you can manually specify the context size in `provider_options`:

 ```yaml
 genai:
-  provider: llamacpp
-  base_url: http://localhost:8080
+  provider: openai
+  base_url: http://your-llama-server
  model: your-model-name
  provider_options:
-    temperature: 0.7
-    repeat_penalty: 1.05
-    top_p: 0.8
-    top_k: 40
-    min_p: 0.05
-    seed: -1
+    context_size: 8192 # Specify the configured context size
 ```

-All llama.cpp native options can be passed through `provider_options`, including `temperature`, `top_k`, `top_p`, `min_p`, `repeat_penalty`, `repeat_last_n`, `seed`, `grammar`, and more. See the [llama.cpp server documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) for a complete list of available parameters.
+This ensures Frigate uses the correct context window size when generating prompts.

-## Google Gemini
+:::
+
+#### Configuration
+
+```yaml
+genai:
+  provider: openai
+  base_url: http://your-server:port
+  api_key: your-api-key # May not be required for local servers
+  model: your-model-name
+```
+
+To use a different OpenAI-compatible API endpoint, set the `OPENAI_BASE_URL` environment variable to your provider's API URL.
+
+## Cloud Providers
+
+Cloud providers run on remote infrastructure and require an API key for authentication. These services handle all model inference on their servers.
+
+### Ollama Cloud
+
+Ollama also supports [cloud models](https://ollama.com/cloud), where your local Ollama instance handles requests from Frigate, but model inference is performed in the cloud. Set up Ollama locally, sign in with your Ollama account, and specify the cloud model name in your Frigate config. For more details, see the Ollama cloud model [docs](https://docs.ollama.com/cloud).
+
+#### Configuration
+
+```yaml
+genai:
+  provider: ollama
+  base_url: http://localhost:11434
+  model: cloud-model-name
+```
+
+### Google Gemini

 Google Gemini has a [free tier](https://ai.google.dev/pricing) for the API, however the limits may not be sufficient for standard Frigate usage. Choose a plan appropriate for your installation.

-### Supported Models
+#### Supported Models

 You must use a vision capable model with Frigate. Current model variants can be found [in their documentation](https://ai.google.dev/gemini-api/docs/models/gemini).

-### Get API Key
+#### Get API Key

 To start using Gemini, you must first get an API key from [Google AI Studio](https://aistudio.google.com).

@ -134,7 +174,7 @@ To start using Gemini, you must first get an API key from [Google AI Studio](htt
 3. Click "Create API key in new project"
 4. Copy the API key for use in your config

-### Configuration
+#### Configuration

 ```yaml
 genai:
@ -159,19 +199,19 @@ Other HTTP options are available, see the [python-genai documentation](https://g

 :::

-## OpenAI
+### OpenAI

 OpenAI does not have a free tier for their API. With the release of gpt-4o, pricing has been reduced and each generation should cost fractions of a cent if you choose to go this route.

-### Supported Models
+#### Supported Models

 You must use a vision capable model with Frigate. Current model variants can be found [in their documentation](https://platform.openai.com/docs/models).

-### Get API Key
+#### Get API Key

 To start using OpenAI, you must first [create an API key](https://platform.openai.com/api-keys) and [configure billing](https://platform.openai.com/settings/organization/billing/overview).

-### Configuration
+#### Configuration

 ```yaml
 genai:
@ -180,42 +220,19 @@ genai:
  model: gpt-4o
 ```

-:::note
-
-To use a different OpenAI-compatible API endpoint, set the `OPENAI_BASE_URL` environment variable to your provider's API URL.
-
-:::
-
-:::tip
-
-For OpenAI-compatible servers (such as llama.cpp) that don't expose the configured context size in the API response, you can manually specify the context size in `provider_options`:
-
-```yaml
-genai:
-  provider: openai
-  base_url: http://your-llama-server
-  model: your-model-name
-  provider_options:
-    context_size: 8192 # Specify the configured context size
-```
-
-This ensures Frigate uses the correct context window size when generating prompts.
-
-:::
-
-## Azure OpenAI
+### Azure OpenAI

 Microsoft offers several vision models through Azure OpenAI. A subscription is required.

-### Supported Models
+#### Supported Models

 You must use a vision capable model with Frigate. Current model variants can be found [in their documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models).

-### Create Resource and Get API Key
+#### Create Resource and Get API Key

 To start using Azure OpenAI, you must first [create a resource](https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource). You'll need your API key, model name, and resource URL, which must include the `api-version` parameter (see the example below).

-### Configuration
+#### Configuration

 ```yaml
 genai:
--- a/frigate/genai/azure-openai.py
+++ b/frigate/genai/azure-openai.py
@ -167,3 +167,123 @@ class OpenAIClient(GenAIClient):
                "tool_calls": None,
                "finish_reason": "error",
            }
+
+    async def chat_with_tools_stream(
+        self,
+        messages: list[dict[str, Any]],
+        tools: Optional[list[dict[str, Any]]] = None,
+        tool_choice: Optional[str] = "auto",
+    ):
+        """
+        Stream chat with tools; yields content deltas then final message.
+
+        Implements streaming function calling/tool usage for Azure OpenAI models.
+        """
+        try:
+            openai_tool_choice = None
+            if tool_choice:
+                if tool_choice == "none":
+                    openai_tool_choice = "none"
+                elif tool_choice == "auto":
+                    openai_tool_choice = "auto"
+                elif tool_choice == "required":
+                    openai_tool_choice = "required"
+
+            request_params = {
+                "model": self.genai_config.model,
+                "messages": messages,
+                "timeout": self.timeout,
+                "stream": True,
+            }
+
+            if tools:
+                request_params["tools"] = tools
+                if openai_tool_choice is not None:
+                    request_params["tool_choice"] = openai_tool_choice
+
+            # Use streaming API
+            content_parts: list[str] = []
+            tool_calls_by_index: dict[int, dict[str, Any]] = {}
+            finish_reason = "stop"
+
+            stream = self.provider.chat.completions.create(**request_params)
+
+            for chunk in stream:
+                if not chunk or not chunk.choices:
+                    continue
+
+                choice = chunk.choices[0]
+                delta = choice.delta
+
+                # Check for finish reason
+                if choice.finish_reason:
+                    finish_reason = choice.finish_reason
+
+                # Extract content deltas
+                if delta.content:
+                    content_parts.append(delta.content)
+                    yield ("content_delta", delta.content)
+
+                # Extract tool calls
+                if delta.tool_calls:
+                    for tc in delta.tool_calls:
+                        idx = tc.index
+                        fn = tc.function
+
+                        if idx not in tool_calls_by_index:
+                            tool_calls_by_index[idx] = {
+                                "id": tc.id or "",
+                                "name": fn.name if fn and fn.name else "",
+                                "arguments": "",
+                            }
+
+                        t = tool_calls_by_index[idx]
+                        if tc.id:
+                            t["id"] = tc.id
+                        if fn and fn.name:
+                            t["name"] = fn.name
+                        if fn and fn.arguments:
+                            t["arguments"] += fn.arguments
+
+            # Build final message
+            full_content = "".join(content_parts).strip() or None
+
+            # Convert tool calls to list format
+            tool_calls_list = None
+            if tool_calls_by_index:
+                tool_calls_list = []
+                for tc in tool_calls_by_index.values():
+                    try:
+                        # Parse accumulated arguments as JSON
+                        parsed_args = json.loads(tc["arguments"])
+                    except (json.JSONDecodeError, Exception):
+                        parsed_args = tc["arguments"]
+
+                    tool_calls_list.append(
+                        {
+                            "id": tc["id"],
+                            "name": tc["name"],
+                            "arguments": parsed_args,
+                        }
+                    )
+                finish_reason = "tool_calls"
+
+            yield (
+                "message",
+                {
+                    "content": full_content,
+                    "tool_calls": tool_calls_list,
+                    "finish_reason": finish_reason,
+                },
+            )
+
+        except Exception as e:
+            logger.warning("Azure OpenAI streaming returned an error: %s", str(e))
+            yield (
+                "message",
+                {
+                    "content": None,
+                    "tool_calls": None,
+                    "finish_reason": "error",
+                },
+            )
--- a/frigate/genai/gemini.py
+++ b/frigate/genai/gemini.py
@ -1,5 +1,6 @@
 """Gemini Provider for Frigate AI."""

+import json
 import logging
 from typing import Any, Optional

@ -273,3 +274,239 @@ class GeminiClient(GenAIClient):
                "tool_calls": None,
                "finish_reason": "error",
            }
+
+    async def chat_with_tools_stream(
+        self,
+        messages: list[dict[str, Any]],
+        tools: Optional[list[dict[str, Any]]] = None,
+        tool_choice: Optional[str] = "auto",
+    ):
+        """
+        Stream chat with tools; yields content deltas then final message.
+
+        Implements streaming function calling/tool usage for Gemini models.
+        """
+        try:
+            # Convert messages to Gemini format
+            gemini_messages = []
+            for msg in messages:
+                role = msg.get("role", "user")
+                content = msg.get("content", "")
+
+                # Map roles to Gemini format
+                if role == "system":
+                    # Gemini doesn't have system role, prepend to first user message
+                    if gemini_messages and gemini_messages[0].role == "user":
+                        gemini_messages[0].parts[
+                            0
+                        ].text = f"{content}\n\n{gemini_messages[0].parts[0].text}"
+                    else:
+                        gemini_messages.append(
+                            types.Content(
+                                role="user", parts=[types.Part.from_text(text=content)]
+                            )
+                        )
+                elif role == "assistant":
+                    gemini_messages.append(
+                        types.Content(
+                            role="model", parts=[types.Part.from_text(text=content)]
+                        )
+                    )
+                elif role == "tool":
+                    # Handle tool response
+                    function_response = {
+                        "name": msg.get("name", ""),
+                        "response": content,
+                    }
+                    gemini_messages.append(
+                        types.Content(
+                            role="function",
+                            parts=[
+                                types.Part.from_function_response(function_response)
+                            ],
+                        )
+                    )
+                else:  # user
+                    gemini_messages.append(
+                        types.Content(
+                            role="user", parts=[types.Part.from_text(text=content)]
+                        )
+                    )
+
+            # Convert tools to Gemini format
+            gemini_tools = None
+            if tools:
+                gemini_tools = []
+                for tool in tools:
+                    if tool.get("type") == "function":
+                        func = tool.get("function", {})
+                        gemini_tools.append(
+                            types.Tool(
+                                function_declarations=[
+                                    types.FunctionDeclaration(
+                                        name=func.get("name", ""),
+                                        description=func.get("description", ""),
+                                        parameters=func.get("parameters", {}),
+                                    )
+                                ]
+                            )
+                        )
+
+            # Configure tool choice
+            tool_config = None
+            if tool_choice:
+                if tool_choice == "none":
+                    tool_config = types.ToolConfig(
+                        function_calling_config=types.FunctionCallingConfig(mode="NONE")
+                    )
+                elif tool_choice == "auto":
+                    tool_config = types.ToolConfig(
+                        function_calling_config=types.FunctionCallingConfig(mode="AUTO")
+                    )
+                elif tool_choice == "required":
+                    tool_config = types.ToolConfig(
+                        function_calling_config=types.FunctionCallingConfig(mode="ANY")
+                    )
+
+            # Build request config
+            config_params = {"candidate_count": 1}
+
+            if gemini_tools:
+                config_params["tools"] = gemini_tools
+
+            if tool_config:
+                config_params["tool_config"] = tool_config
+
+            # Merge runtime_options
+            if isinstance(self.genai_config.runtime_options, dict):
+                config_params.update(self.genai_config.runtime_options)
+
+            # Use streaming API
+            content_parts: list[str] = []
+            tool_calls_by_index: dict[int, dict[str, Any]] = {}
+            finish_reason = "stop"
+
+            response = self.provider.models.generate_content_stream(
+                model=self.genai_config.model,
+                contents=gemini_messages,
+                config=types.GenerateContentConfig(**config_params),
+            )
+
+            async for chunk in response:
+                if not chunk or not chunk.candidates:
+                    continue
+
+                candidate = chunk.candidates[0]
+
+                # Check for finish reason
+                if hasattr(candidate, "finish_reason") and candidate.finish_reason:
+                    from google.genai.types import FinishReason
+
+                    if candidate.finish_reason == FinishReason.STOP:
+                        finish_reason = "stop"
+                    elif candidate.finish_reason == FinishReason.MAX_TOKENS:
+                        finish_reason = "length"
+                    elif candidate.finish_reason in [
+                        FinishReason.SAFETY,
+                        FinishReason.RECITATION,
+                    ]:
+                        finish_reason = "error"
+
+                # Extract content and tool calls from chunk
+                if candidate.content and candidate.content.parts:
+                    for part in candidate.content.parts:
+                        if part.text:
+                            content_parts.append(part.text)
+                            yield ("content_delta", part.text)
+                        elif part.function_call:
+                            # Handle function call
+                            try:
+                                arguments = (
+                                    dict(part.function_call.args)
+                                    if part.function_call.args
+                                    else {}
+                                )
+                            except Exception:
+                                arguments = {}
+
+                            # Store tool call
+                            tool_call_id = part.function_call.name or ""
+                            tool_call_name = part.function_call.name or ""
+
+                            # Check if we already have this tool call
+                            found_index = None
+                            for idx, tc in tool_calls_by_index.items():
+                                if tc["name"] == tool_call_name:
+                                    found_index = idx
+                                    break
+
+                            if found_index is None:
+                                found_index = len(tool_calls_by_index)
+                                tool_calls_by_index[found_index] = {
+                                    "id": tool_call_id,
+                                    "name": tool_call_name,
+                                    "arguments": "",
+                                }
+
+                            # Accumulate arguments
+                            if arguments:
+                                tool_calls_by_index[found_index]["arguments"] += (
+                                    json.dumps(arguments)
+                                    if isinstance(arguments, dict)
+                                    else str(arguments)
+                                )
+
+            # Build final message
+            full_content = "".join(content_parts).strip() or None
+
+            # Convert tool calls to list format
+            tool_calls_list = None
+            if tool_calls_by_index:
+                tool_calls_list = []
+                for tc in tool_calls_by_index.values():
+                    try:
+                        # Try to parse accumulated arguments as JSON
+                        parsed_args = json.loads(tc["arguments"])
+                    except (json.JSONDecodeError, Exception):
+                        parsed_args = tc["arguments"]
+
+                    tool_calls_list.append(
+                        {
+                            "id": tc["id"],
+                            "name": tc["name"],
+                            "arguments": parsed_args,
+                        }
+                    )
+                finish_reason = "tool_calls"
+
+            yield (
+                "message",
+                {
+                    "content": full_content,
+                    "tool_calls": tool_calls_list,
+                    "finish_reason": finish_reason,
+                },
+            )
+
+        except errors.APIError as e:
+            logger.warning("Gemini API error during streaming: %s", str(e))
+            yield (
+                "message",
+                {
+                    "content": None,
+                    "tool_calls": None,
+                    "finish_reason": "error",
+                },
+            )
+        except Exception as e:
+            logger.warning(
+                "Gemini returned an error during chat_with_tools_stream: %s", str(e)
+            )
+            yield (
+                "message",
+                {
+                    "content": None,
+                    "tool_calls": None,
+                    "finish_reason": "error",
+                },
+            )
--- a/frigate/genai/llama_cpp.py
+++ b/frigate/genai/llama_cpp.py
@ -102,7 +102,7 @@ class LlamaCppClient(GenAIClient):

    def get_context_size(self) -> int:
        """Get the context window size for llama.cpp."""
-        return self.provider_options.get("context_size", 4096)
+        return int(self.provider_options.get("context_size", 4096))

    def _build_payload(
        self,
--- a/frigate/genai/manager.py
+++ b/frigate/genai/manager.py
@ -21,13 +21,12 @@ class GenAIClientManager:
    """Manages GenAI provider clients from Frigate config."""

    def __init__(self, config: FrigateConfig) -> None:
-        self._config = config
        self._tool_client: Optional[GenAIClient] = None
        self._vision_client: Optional[GenAIClient] = None
        self._embeddings_client: Optional[GenAIClient] = None
-        self._update_config()
+        self.update_config(config)

-    def _update_config(self) -> None:
+    def update_config(self, config: FrigateConfig) -> None:
        """Build role clients from current Frigate config.genai.

        Called from __init__ and can be called again when config is reloaded.
@ -40,12 +39,12 @@ class GenAIClientManager:
        self._vision_client = None
        self._embeddings_client = None

-        if not self._config.genai:
+        if not config.genai:
            return

        load_providers()

-        for _name, genai_cfg in self._config.genai.items():
+        for _name, genai_cfg in config.genai.items():
            if not genai_cfg.provider:
                continue
            provider_cls = PROVIDERS.get(genai_cfg.provider)
--- a/frigate/genai/ollama.py
+++ b/frigate/genai/ollama.py
@ -85,8 +85,8 @@ class OllamaClient(GenAIClient):

    def get_context_size(self) -> int:
        """Get the context window size for Ollama."""
-        return self.genai_config.provider_options.get("options", {}).get(
-            "num_ctx", 4096
+        return int(
+            self.genai_config.provider_options.get("options", {}).get("num_ctx", 4096)
        )

    def _build_request_params(
--- a/frigate/genai/openai.py
+++ b/frigate/genai/openai.py
@ -30,6 +30,10 @@ class OpenAIClient(GenAIClient):
            for k, v in self.genai_config.provider_options.items()
            if k != "context_size"
        }
+
+        if self.genai_config.base_url:
+            provider_opts["base_url"] = self.genai_config.base_url
+
        return OpenAI(api_key=self.genai_config.api_key, **provider_opts)

    def _send(self, prompt: str, images: list[bytes]) -> Optional[str]:
@ -227,3 +231,142 @@ class OpenAIClient(GenAIClient):
                "tool_calls": None,
                "finish_reason": "error",
            }
+
+    async def chat_with_tools_stream(
+        self,
+        messages: list[dict[str, Any]],
+        tools: Optional[list[dict[str, Any]]] = None,
+        tool_choice: Optional[str] = "auto",
+    ):
+        """
+        Stream chat with tools; yields content deltas then final message.
+
+        Implements streaming function calling/tool usage for OpenAI models.
+        """
+        try:
+            openai_tool_choice = None
+            if tool_choice:
+                if tool_choice == "none":
+                    openai_tool_choice = "none"
+                elif tool_choice == "auto":
+                    openai_tool_choice = "auto"
+                elif tool_choice == "required":
+                    openai_tool_choice = "required"
+
+            request_params = {
+                "model": self.genai_config.model,
+                "messages": messages,
+                "timeout": self.timeout,
+                "stream": True,
+            }
+
+            if tools:
+                request_params["tools"] = tools
+                if openai_tool_choice is not None:
+                    request_params["tool_choice"] = openai_tool_choice
+
+            if isinstance(self.genai_config.provider_options, dict):
+                excluded_options = {"context_size"}
+                provider_opts = {
+                    k: v
+                    for k, v in self.genai_config.provider_options.items()
+                    if k not in excluded_options
+                }
+                request_params.update(provider_opts)
+
+            # Use streaming API
+            content_parts: list[str] = []
+            tool_calls_by_index: dict[int, dict[str, Any]] = {}
+            finish_reason = "stop"
+
+            stream = self.provider.chat.completions.create(**request_params)
+
+            for chunk in stream:
+                if not chunk or not chunk.choices:
+                    continue
+
+                choice = chunk.choices[0]
+                delta = choice.delta
+
+                # Check for finish reason
+                if choice.finish_reason:
+                    finish_reason = choice.finish_reason
+
+                # Extract content deltas
+                if delta.content:
+                    content_parts.append(delta.content)
+                    yield ("content_delta", delta.content)
+
+                # Extract tool calls
+                if delta.tool_calls:
+                    for tc in delta.tool_calls:
+                        idx = tc.index
+                        fn = tc.function
+
+                        if idx not in tool_calls_by_index:
+                            tool_calls_by_index[idx] = {
+                                "id": tc.id or "",
+                                "name": fn.name if fn and fn.name else "",
+                                "arguments": "",
+                            }
+
+                        t = tool_calls_by_index[idx]
+                        if tc.id:
+                            t["id"] = tc.id
+                        if fn and fn.name:
+                            t["name"] = fn.name
+                        if fn and fn.arguments:
+                            t["arguments"] += fn.arguments
+
+            # Build final message
+            full_content = "".join(content_parts).strip() or None
+
+            # Convert tool calls to list format
+            tool_calls_list = None
+            if tool_calls_by_index:
+                tool_calls_list = []
+                for tc in tool_calls_by_index.values():
+                    try:
+                        # Parse accumulated arguments as JSON
+                        parsed_args = json.loads(tc["arguments"])
+                    except (json.JSONDecodeError, Exception):
+                        parsed_args = tc["arguments"]
+
+                    tool_calls_list.append(
+                        {
+                            "id": tc["id"],
+                            "name": tc["name"],
+                            "arguments": parsed_args,
+                        }
+                    )
+                finish_reason = "tool_calls"
+
+            yield (
+                "message",
+                {
+                    "content": full_content,
+                    "tool_calls": tool_calls_list,
+                    "finish_reason": finish_reason,
+                },
+            )
+
+        except TimeoutException as e:
+            logger.warning("OpenAI streaming request timed out: %s", str(e))
+            yield (
+                "message",
+                {
+                    "content": None,
+                    "tool_calls": None,
+                    "finish_reason": "error",
+                },
+            )
+        except Exception as e:
+            logger.warning("OpenAI streaming returned an error: %s", str(e))
+            yield (
+                "message",
+                {
+                    "content": None,
+                    "tool_calls": None,
+                    "finish_reason": "error",
+                },
+            )