Skip to content

Commit 5b821c1

Browse files
committed
first draft
1 parent 1fe2b9f commit 5b821c1

File tree

2 files changed

+319
-0
lines changed

2 files changed

+319
-0
lines changed

geps/gep-3440/index.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
# GEP-3440: Gateway API Support for gRPC Retries
2+
3+
* Issue: [#3440](https://github.com/kubernetes-sigs/gateway-api/issues/3440)
4+
* Status: Provisional
5+
6+
## TLDR
7+
This proposal introduces support for gRPC retries in the Gateway API,
8+
allowing for configuration of retry attempts, backoff duration, and retryable status codes for gRPC routes.
9+
10+
## Goals
11+
12+
- To allow specification of gRPC status codes that should be retried.
13+
- To allow specification of the maximum number of times to retry a gRPC request.
14+
- To allow specification of the minimum backoff interval between retry attempts for gRPC requests.
15+
- Retry configuration must be applicable to most known Gateway API implementations for gRPC.
16+
- To define any interaction with configured gRPC timeouts and backoff.
17+
18+
## Non-Goals
19+
20+
- No standard APIs for advanced retry logic, such as integrating with rate-limiting headers.
21+
- No default retry policies for all routes within a namespace or for routes tied to a specific Gateway.
22+
- No support for detailed backoff adjustments, like fine-tuning intervals, adding jitter, or setting max duration caps.
23+
- No retry support for streaming or bidirectional APIs (maybe considered in future proposals).
24+
25+
## Introduction
26+
27+
To keep services reliable and resilient, a Gateway API implementation should be able to retry failed gRPC requests to
28+
backend services before giving up and returning an error to clients.
29+
30+
Retries are helpful for several key reasons:
31+
1. **Network failures**: Network issues can often cause temporary errors. Retrying a request helps to mitigate these
32+
intermittent problems.
33+
2. **Server-side failures**: Servers may fail temporarily due to overload or other issues.
34+
Retrying allows requests to succeed once these conditions are resolved.
35+
3. **Recovery from Temporary Errors**: Certain errors, like "Unavailable" or "resource-exhausted" are often short-lived.
36+
Retrying can allow the request to complete once these issues clear up.
37+
38+
This proposal aims to establish a streamlined, consistent API for retrying gRPC requests, covering essential
39+
functionality in a way that is broadly applicable across implementations.
40+
41+
## Background on implementations
42+
43+
Researching how different Gateway API implementations handle retries for gRPC requests.
44+
45+
### Envoy
46+
Envoy supports retries for gRPC requests using the `retry_policy` field in the `route` configuration of the HTTP filter.
47+
`retry_on` specifies the gRPC status codes that should trigger a retry by using `x-envoy-retry-grpc-on`,
48+
and it supports a few built-in status codes like:
49+
- `cancelled`: Envoy will attempt a retry if the gRPC status code in the response headers is “cancelled”.
50+
- `deadline-exceeded`: Envoy will attempt a retry if the gRPC status code in the response headers is “deadline-exceeded”.
51+
- `internal`: Envoy will attempt a retry if the gRPC status code in the response headers is “internal”.
52+
- `resource-exhausted`: Envoy will attempt a retry if the gRPC status code in the response headers is “resource-exhausted”.
53+
- `unavailable`: Envoy will attempt a retry if the gRPC status code in the response headers is “unavailable”.
54+
55+
As with the `x-envoy-retry-grpc-on` header, the number of retries can be controlled via the `x-envoy-max-retries` header.
56+
57+
By default, Envoy uses a fully jittered exponential backoff algorithm for retries.
58+
This means that after a failed attempt, Envoy waits a random amount of time (with jitter) based on
59+
an exponential growth pattern before trying again.
60+
- **Default Timing**: The base interval starts at 25ms, and each subsequent retry can increase
61+
this interval exponentially. By default, the maximum interval is capped at 250ms (10 times the base interval).
62+
- **Per-Attempt Timeout (`per_try_timeout`)**: Envoy allows you to set a specific timeout for each retry attempt,
63+
known as `per_try_timeout`. This timeout includes the initial request and each retry attempt.
64+
If you don’t specify a `per_try_timeout`, Envoy uses the global route timeout for the total duration of the request.
65+
66+
In the Gateway API, this `per_try_timeout` will be equivalent to the BackendRequest timeout in the GRPCRouteRule.
67+
This ensures that each retry attempt, including the initial one, respects the overall timeout defined for the backend
68+
request, preventing retries from extending beyond the desired duration.
69+
70+
### Nginx
71+
`ngx_http_grpc_module` in Nginx supports retries for gRPC requests using the `grpc_pass` directive.
72+
73+
For gRPC requests, Nginx allows retries under certain conditions by forwarding requests to another server in
74+
an upstream pool when the initial request fails.
75+
The following configuration options are available to control when and how retries occur:
76+
1. **Retry Conditions** (`grpc_next_upstream`):
77+
Nginx can retry a request if certain issues are encountered, such as:
78+
- Network errors (e.g., connection or read errors).
79+
- Timeouts when establishing a connection or reading a response.
80+
- Invalid headers if the server sends an empty or malformed response.
81+
- Specific HTTP error codes (e.g., 500, 502, 503, 504, 429) can be configured as retryable for gRPC responses.
82+
By default, Nginx only retries on network error and timeout,
83+
but you can specify other conditions (like HTTP status codes) to expand retry options.
84+
2. **Retry Limit by Time** (`grpc_next_upstream_timeout`):
85+
You can set a total time limit for how long Nginx will attempt retries.
86+
This limits the retry process to a specified time window, after which Nginx will stop attempting further retries.
87+
3. **Retry Limit by Number** (`grpc_next_upstream_tries`):
88+
You can set a maximum number of retry attempts for a request.
89+
Once this limit is reached, Nginx will stop attempting further retries.
90+
4. **Non-Idempotent Requests** (`non_idempotent`):
91+
By default, Nginx does not retry non-idempotent requests (like POST or PUT) because they can cause side effects
92+
if sent multiple times. However, you can enable retries for non-idempotent requests if needed.
93+
94+
**Important Considerations**:
95+
- **Partial Responses**: Nginx can only retry if no part of the response has been sent to the client.
96+
If an error occurs mid-response, retries are not possible.
97+
- **Unsuccessful Attempts**: Errors like `timeout` and `invalid_header` are always considered unsuccessful and will
98+
trigger retries if specified, while errors like `403` and `404` are not retryable by default.
99+
100+
### HAProxy
101+
1. **Retry Conditions**: HAProxy can retry requests based on various network conditions
102+
(e.g., connection failures, timeouts) and some HTTP error codes. While HAProxy does support gRPC via HTTP/2, it does not
103+
have built-in support for handling specific gRPC error codes (like `Cancelled`, `Deadline Exceeded`).
104+
It relies on HTTP-level conditions for retries, so its gRPC support is less granular than the GEP requires.
105+
2. **Retry Limits**: HAProxy allows you to set a maximum number of retries for a request using the `retries` directive.
106+
It also supports setting a timeout for the entire retry process using the `timeout connect` and `timeout server` directives.
107+
108+
### Traefik
109+
1. **Retry Conditions**: Traefik allows for retries based on HTTP-level conditions (e.g., connection errors and
110+
certain HTTP status codes like 500, 502, 503, and 504), but it does not natively interpret specific gRPC error codes
111+
like `UNAVAILABLE` or `DEADLINE_EXCEEDED`. This means that, while Traefik can retry requests on common HTTP errors
112+
that might represent temporary issues, it lacks the ability to directly handle and retry based on
113+
gRPC-specific error codes, limiting its alignment with the GEP’s requirement for granular gRPC error handling.
114+
2. **Retry Limits**: Traefik provides configurable retry attempts and can set a maximum number of retries. However,
115+
Traefik does not offer per-try timeout controls specific to each retry attempt. Instead, it typically relies on a
116+
global request timeout, limiting the flexibility needed for more precise gRPC retry management (like Envoy’s `per_try_timeout`).
117+
118+
## API
119+
Having a dedicated API for gRPC retry conditions is necessary because gRPC uses
120+
unique error codes (e.g., `UNAVAILABLE`, `DEADLINE_EXCEEDED`) that represent transient issues specific to its protocol,
121+
which are not adequately covered by general HTTP status codes. gRPC also supports streaming and real-time communications,
122+
making retry strategies more complex than those used for standard HTTP requests. Existing proxies like Envoy handle
123+
gRPC retries with specialized logic, while other proxies rely on HTTP error codes, lacking the precision needed
124+
for gRPC.
125+
126+
### Go
127+
128+
```go
129+
type GRPCRouteRule struct {
130+
// Retry defines the configuration for when to retry a gRPC request.
131+
//
132+
// Support: Extended
133+
//
134+
// +optional
135+
// <gateway:experimental>
136+
Retry *GRPCRouteRetry `json:"retry,omitempty"`
137+
138+
// ...
139+
}
140+
141+
// GRPCRouteRetry defines retry configuration for a GRPCRoute.
142+
//
143+
// Implementations SHOULD retry on common transient gRPC errors
144+
// if a retry configuration is specified.
145+
//
146+
type GRPCRouteRetry struct {
147+
// Reasons defines the gRPC error conditions for which a backend request
148+
// should be retried.
149+
//
150+
// Supported gRPC error conditions:
151+
// * "cancelled"
152+
// * "deadline-exceeded"
153+
// * "internal"
154+
// * "resource-exhausted"
155+
// * "unavailable"
156+
//
157+
// Implementations MUST support retrying requests for these conditions
158+
// when specified.
159+
//
160+
// Support: Extended
161+
//
162+
// +optional
163+
// <gateway:experimental>
164+
Reasons []GRPCRouteRetryCondition `json:"reasons,omitempty"`
165+
166+
// Attempts specifies the maximum number of times an individual request
167+
// from the gateway to a backend should be retried.
168+
//
169+
// If the maximum number of retries has been attempted without a successful
170+
// response from the backend, the Gateway MUST return an error.
171+
//
172+
// When this field is unspecified, the number of times to attempt to retry
173+
// a backend request is implementation-specific.
174+
//
175+
// Support: Extended
176+
//
177+
// +optional
178+
Attempts *int `json:"attempts,omitempty"`
179+
180+
// Backoff specifies the minimum duration a Gateway should wait between
181+
// retry attempts, represented in Gateway API Duration formatting.
182+
//
183+
// For example, setting the `rules[].retry.backoff` field to `100ms`
184+
// will cause a backend request to be retried approximately 100 milliseconds
185+
// after timing out or receiving a specified retryable condition.
186+
//
187+
// Implementations MAY use an exponential or alternative backoff strategy,
188+
// MAY cap the maximum backoff duration, and MAY add jitter to stagger requests,
189+
// as long as unsuccessful backend requests are not retried before the configured
190+
// minimum duration.
191+
//
192+
// If a Request timeout (`rules[].timeouts.request`) is configured, the entire
193+
// duration of the initial request and any retry attempts MUST not exceed the
194+
// Request timeout. Ongoing retry attempts should be cancelled if this duration
195+
// is reached, and the Gateway MUST return a timeout error.
196+
//
197+
// Support: Extended
198+
//
199+
// +optional
200+
Backoff *Duration `json:"backoff,omitempty"`
201+
}
202+
203+
// GRPCRouteRetryCondition defines a gRPC error condition for which a backend
204+
// request should be retried.
205+
//
206+
// The following conditions are considered retryable:
207+
//
208+
// * "cancelled"
209+
// * "deadline-exceeded"
210+
// * "internal"
211+
// * "resource-exhausted"
212+
// * "unavailable"
213+
//
214+
// Implementations MAY support additional gRPC error codes if applicable.
215+
//
216+
// +kubebuilder:validation:Enum=cancelled;deadline-exceeded;internal;resource-exhausted;unavailable
217+
type GRPCRouteRetryCondition string
218+
219+
// Duration is a string value representing a duration in time.
220+
// Format follows GEP-2257, which is a subset of Golang's time.ParseDuration syntax.
221+
//
222+
// +kubebuilder:validation:Pattern=`^([0-9]{1,5}(h|m|s|ms)){1,4}$`
223+
type Duration string
224+
```
225+
226+
### YAML
227+
```yaml
228+
apiVersion: gateway.networking.k8s.io/v1
229+
kind: GRPCRoute
230+
metadata:
231+
name: foo-route
232+
spec:
233+
parentRefs:
234+
- name: example-gateway
235+
hostnames:
236+
- "foo.example.com"
237+
rules:
238+
- matches:
239+
- method:
240+
service: com.example
241+
method: Login
242+
retry:
243+
reasons:
244+
- cancelled
245+
- deadline-exceeded
246+
- internal
247+
- resource-exhausted
248+
- unavailable
249+
attempts: 3
250+
backoff: 100ms
251+
backendRefs:
252+
- name: foo-svc
253+
port: 50051
254+
```
255+
256+
## Conformance Details
257+
To ensure correct gRPC retry functionality, the following tests must be implemented across Gateway API implementations:
258+
1. `SupportGRPCRouteRetryBackendTimeout`
259+
- **Test**: Verify retries respect the BackendRequestTimeout. Requests should fail if the timeout is reached, even with retries.
260+
- **Expected**: Retries occur within the configured timeout, and fail if exceeded.
261+
2. `SupportGRPCRouteRetry`
262+
- **Test**: Ensure retries are triggered for retryable gRPC errors (cancelled, deadline-exceeded, internal, resource-exhausted, unavailable).
263+
- **Expected**: Retries for retryable errors; no retries for non-retryable errors.
264+
3. `SupportGRPCRouteRetryBackoff`
265+
- **Test**: Confirm retries use the configured backoff strategy.
266+
- **Expected**: Retries happen with increasing delay as per backoff configuration.
267+
268+
## Alternatives
269+
270+
### GRPCRoute filter
271+
An alternative approach could be to introduce a new filter for GRPCRoute that handles retries. However, as we have already
272+
established a `retry` field in the HTTPRouteRule, it makes sense to extend this to GRPCRoute for consistency.
273+
274+
## References
275+
276+
- [gRPC Retry Design](https://grpc.io/blog/guides/retry/)
277+
- [gRPC Status Codes](https://grpc.io/docs/guides/error/)
278+
- [Envoy Retry Policy](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-msg-config-route-v3-retry-policy)
279+
- [Nginx gRPC Module](https://nginx.org/en/docs/http/ngx_http_grpc_module.html)
280+
- [HAProxy Retries](https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4.2-retries)
281+
```

geps/gep-3440/metadata.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
apiVersion: internal.gateway.networking.k8s.io/v1alpha1
2+
kind: GEPDetails
3+
number: 696
4+
name: GRPC Retries
5+
status: Provisional
6+
# Any authors who contribute to the GEP in any way should be listed here using
7+
# their Github handle.
8+
authors:
9+
- shadialtarsha
10+
relationships:
11+
# obsoletes indicates that a GEP makes the linked GEP obsolete, and completely
12+
# replaces that GEP. The obsoleted GEP MUST have its obsoletedBy field
13+
# set back to this GEP, and MUST be moved to Declined.
14+
obsoletes: {}
15+
obsoletedBy: {}
16+
# extends indicates that a GEP extends the linkned GEP, adding more detail
17+
# or additional implementation. The extended GEP MUST have its extendedBy
18+
# field set back to this GEP.
19+
extends: {}
20+
extendedBy: {}
21+
# seeAlso indicates other GEPs that are relevant in some way without being
22+
# covered by an existing relationship.
23+
seeAlso: {}
24+
# references is a list of hyperlinks to relevant external references.
25+
# It's intended to be used for storing Github discussions, Google docs, etc.
26+
references:
27+
- https://grpc.io/docs/guides/retry/
28+
- https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-msg-config-route-v3-retrypolicy
29+
- https://grpc.github.io/grpc/core/md_doc_grpc_xds_features.html
30+
# featureNames is a list of the feature names introduced by the GEP, if there
31+
# are any. This will allow us to track which feature was introduced by which GEP.
32+
featureNames:
33+
- SupportGRPCRRouteRetryBackendTimeout
34+
- SupportGRPCRouteRetry
35+
- SupportGRPCRouteRetryBackoff
36+
# changelog is a list of hyperlinks to PRs that make changes to the GEP, in
37+
# ascending date order.
38+
changelog: {}

0 commit comments

Comments
 (0)