-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Bug: BindingHelper throws InvalidOperationException instead of a retryable exception when local RPC address is temporarily unavailable
Context
Over the past week, our Azure Functions Isolated Worker apps have been experiencing frequent Grpc.Core.RpcException failures (StatusCode="Internal", HTTP_1_1_REQUIRED) on DurableTaskClient calls such as GetInstanceAsync, ScheduleNewOrchestrationInstanceAsync, and RaiseEventAsync. These are transient gRPC sidecar issues, so we implemented an application-level retry strategy using Polly to wrap all DurableTaskClient calls.
While investigating, we discovered a second failure mode — System.InvalidOperationException: "The local RPC address has not been configured!" — which appears to be caused by the same underlying network/sidecar instability. However, this exception cannot be retried at the application level because it occurs in the host binding pipeline before our function code executes.
Summary
When the Durable Task gRPC sidecar's local RPC address is temporarily unavailable, BindingHelper.DurableOrchestrationClientToString throws InvalidOperationException. This is a non-retryable exception for what is a transient infrastructure condition. We expect the library to handle this transient issue in a more graceful and reliable way rather than immediately failing with an unrecoverable exception.
The issue is transient in nature but can persist for extended periods (minutes to hours), during which every function invocation using [DurableClient] binding fails immediately. For queue-triggered functions, this causes messages to rapidly exhaust their dequeue count and move to poison queues, resulting in data loss.
Environment
| Component | Version |
|---|---|
| Azure Functions Host | 4.1047.100.26071 |
| Microsoft.Azure.WebJobs.Extensions.DurableTask | 3.0.0.0 |
| Microsoft.Azure.Functions.Worker | 2.51.0 |
| Microsoft.Azure.Functions.Worker.Sdk | 2.0.0 |
| Microsoft.Azure.Functions.Worker.Extensions.DurableTask | 1.14.1 |
| .NET | 8.0 |
| OS | Windows (Azure App Service) |
| Plan | Premium v3 |
Reproduction
This issue occurs in production under load. We have not identified a reliable minimal reproduction, but the pattern is consistent:
- Function app is running normally processing queue messages
- At some point, the Durable Task sidecar's local RPC address becomes temporarily unavailable
- All subsequent function invocations that use
[DurableClient]binding fail immediately withInvalidOperationException - The condition is transient but can last from minutes to several hours
- During this period, no function using
[DurableClient]can execute
Exception Details
Inner exception (root cause):
System.InvalidOperationException: The local RPC address has not been configured!
at Microsoft.Azure.WebJobs.Extensions.DurableTask.BindingHelper.DurableOrchestrationClientToString(DurableClientAttribute, DurableOrchestrationClientToString) in BindingHelper.cs:line 37
Outer exception:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.InputQueueTrigger
Full call chain:
WorkerFunctionInvoker.InvokeCore
→ WorkerFunctionInvoker.BindInputsAsync
→ ExtensionBinding.BindAsync
→ FunctionBinding.BindStringAsync
→ Binder.BindAsync
→ BindToInputBindingProvider.BuildAsync
→ PatternMatcher.New
→ BindingHelper.DurableOrchestrationClientToString ← FAILS HERE
The failure occurs during the host-side input binding phase, before the isolated worker process receives the invocation. The host attempts to serialize the [DurableClient] binding info (including the local gRPC sidecar RPC address) to pass to the worker, but the address has not been configured.
The Problem
The core issue is that BindingHelper.DurableOrchestrationClientToString treats a transient sidecar availability problem as a fatal error by throwing InvalidOperationException:
-
Wrong exception type:
InvalidOperationExceptionis not recognized as transient by the Azure Functions host. The host treats it as a non-retryable application error, which is incorrect for an infrastructure availability issue. -
No wait/retry: The method fails immediately (~5-7ms) without attempting to wait for the sidecar to become ready or retrying the RPC address lookup.
-
Affects all functions: During the issue window, every function using
[DurableClient]binding fails — timer triggers, queue triggers, and activity functions are all impacted. -
Not interceptable by application code: The failure occurs in the host binding pipeline before user code executes, so no application-level retry (e.g., Polly) can work around it.
-
Poison queue data loss: For queue-triggered functions, the rapid failures cause messages to exhaust their dequeue count (default: 5) and move to poison queues. This results in lost work items that require manual reprocessing.
Expected Behavior
-
BindingHelper.DurableOrchestrationClientToStringshould wait for the sidecar to become ready (with a reasonable timeout) rather than failing immediately when the RPC address is not yet available. -
If waiting is not feasible, the method should throw a retryable exception type (e.g., a custom transient exception or
RpcExceptionwithStatusCode.Unavailable) so the host and built-in retry mechanisms can handle it appropriately. -
The current behavior of throwing
InvalidOperationExceptionis incorrect because it signals a programming error rather than a transient infrastructure condition, preventing any retry-based recovery.
Related
This issue may be related to gRPC sidecar instability we've also reported — Grpc.Core.RpcException with StatusCode="Internal" and HTTP_1_1_REQUIRED / socket exhaustion errors occurring on DurableTaskClient calls. Both issues point to instability in the Durable Task gRPC sidecar lifecycle management.