Skip to content

Commit 1639e67

Browse files
authored
Add a design doc for resource validation (#343)
1 parent 08fe11c commit 1639e67

File tree

1 file changed

+218
-0
lines changed

1 file changed

+218
-0
lines changed

design/resource-validation.md

+218
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Resource Validation
2+
3+
NGINX Kubernetes Gateway (NKG) must validate Gateway API resources for reliability, security, and conformity with the
4+
Gateway API specification.
5+
6+
## Background
7+
8+
### Why Validate?
9+
10+
NKG transforms the Gateway API resources into NGINX configuration. Before transforming a resource, NKG needs to ensure
11+
its validity, which is important for the following reasons:
12+
13+
1. *To prevent an invalid value from propagating into NGINX configuration*. For example, the URI in a path-based routing
14+
rule. The propagating has the following consequences:
15+
1. Invalid input will make NGINX fail to reload. Moreover, until the corresponding invalid config is removed from
16+
NGINX configuration, NKG will not be able to reload NGINX for any future configuration changes. This affects the
17+
reliability of NKG.
18+
2. Malicious input can breach the security of NKG. For example, if a malicious user can insert raw NGINX config (
19+
something similar to an SQL injection), they can configure NGINX to serve the files on the container filesystem.
20+
This affects the security of NKG.
21+
2. *To conform to the Gateway API spec*. For example, if an HTTPRoute configures an unsupported filter, an
22+
implementation like NKG needs to "set Accepted Condition for the Route to `status: False`, with a Reason
23+
of `UnsupportedValue`".
24+
25+
### Validation by the Gateway API Project
26+
27+
To help the implementations with the validation, the Gateway API already includes:
28+
29+
* *The OpenAPI scheme with validation rules in the Gateway API CRDs*. It enforces the structure (for example, the field
30+
X must be a string) and the contents of the fields (for example, field Y only allows values 'a' and 'b').
31+
Additionally, it enforces the limits like max lengths on field values. Note:
32+
Kubernetes API server enforces this validation. To bypass it, a user needs to change the CRDs.
33+
* *The webhook validation*. This validation is written in go and run as part of the webhook, which is included in the
34+
Gateway API installation files. The validation covers additional logic, not possible to implement in the CRDs. It does
35+
not repeat the validation from the CRDs. Note: a user can bypass this validation if the webhook is not installed.
36+
37+
However, the built-in validation rules do not cover all validation needs of NKG:
38+
39+
* The rules are not enough for NGINX. For example, the validation rule for the
40+
`value` of the path in a path-based routing rule allows symbols like `;`, `{`
41+
and `}`, which can break NGINX configuration for the
42+
corresponding [location](https://nginx.org/en/docs/http/ngx_http_core_module.html#location) block.
43+
* The rules don't cover unsupported field cases. For example, the webhook does not know which filters are implemented by
44+
NKG, thus it cannot generate an appropriate error for NKG.
45+
46+
Additionally, as mentioned in [GEP-922](https://gateway-api.sigs.k8s.io/geps/gep-922/#implementers),
47+
"implementers must not rely on webhook or CRD validation as a security mechanism. If field values need to be escaped to
48+
secure an implementation, both webhook and CRD validation can be bypassed and cannot be relied on."
49+
50+
## Requirements
51+
52+
Design a validation mechanism for Gateway API resources.
53+
54+
### Personas
55+
56+
* *Cluster admin* who installs Gateway API (the CRDs and Webhook), installs NKG, creates Gateway and GatewayClass
57+
resources.
58+
* *Application developer* who creates HTTPRoutes and other routes.
59+
60+
### User Stories
61+
62+
1. As a cluster admin, I'd like to share NKG among multiple application developers, specifically in a way that invalid
63+
resources of one developer do not affect on the resources of the other developers.
64+
2. As a cluster admin/application developer, I expect that NKG rejects any invalid resources I create and I am able to
65+
see the reasons (errors) for that.
66+
67+
### Goals
68+
69+
* Ensure that NKG continues to work and/or fails predictably in the face of invalid input.
70+
* Ensure that both cluster admin and application developers can see the validation errors reported about the resource
71+
they create (own).
72+
* For the best UX, minimize the feedback loop: users should be able to see most of the validation errors reported by a
73+
Kubernetes API server during a CRUD operation on a resource.
74+
* Ensure that the validation mechanism conforms to the Gateway API spec.
75+
76+
### Non-Goals
77+
78+
* Validation of non-Gateway API resources: Secrets, EndpointSlices. (For example, a TLS Secret resource might include a
79+
non-valid TLS cert that will make NGINX fail to reload).
80+
81+
## Design
82+
83+
We will introduce two validation methods to be run by NKG control plane:
84+
85+
1. Re-run of the Gateway API webhook validation
86+
2. NKG-specific field validation
87+
88+
### Re-run of Webhook Validation
89+
90+
Before processing a resource, NKG will validate it using the functions from
91+
the [validation package](https://github.com/kubernetes-sigs/gateway-api/tree/b241afc88e68c952cc0a59a5c72a51358dc2bada/apis/v1beta1/validation)
92+
from the Gateway API. This will ensure that the webhook validation cannot be bypassed (it can be bypassed if the webhook
93+
is not installed, misconfigured, or running a different version), and it will allow us to avoid repeating the same
94+
validation in our code.
95+
96+
If a resource is invalid:
97+
98+
* NKG will not process it -- it will treat it as if the resource didn't exist. This also means that if the resource was
99+
updated from a valid to an invalid state, NKG will also ignore any previous valid state. For example, it will remove
100+
the generation configuration for an HTTPRoute resource.
101+
* NKG will report the validation error as a
102+
Warning [Event](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/)
103+
for that resource. The Event message will describe the error and explain that the resource was ignored. We chose to
104+
report an Event instead of updating the status, because to update the status, NKG first needs to look inside the
105+
resource to determine whether it belongs to it or not. However, since the webhook validation applies to all parts of
106+
the spec of resource, it means NKG has to look inside the invalid resource and parse potentially invalid parts. To
107+
avoid that, NKG will report an Event. The owner of the resource will be able to see the Event.
108+
* NGK will also report the validation error in the NGK logs.
109+
110+
### NKG-specific validation
111+
112+
After re-running the webhook validation, NKG will run NKG-specific validation, written in go.
113+
114+
NKG-specific validation will:
115+
116+
1. Ensure field values are considered valid by NGINX (cannot make NGINX fail to reload).
117+
2. Ensure valid field values do not include any malicious configuration.
118+
3. Report an error if an unsupported field is present in a resource (as the Gateway API spec prescribes).
119+
120+
NKG-specific validation will not include:
121+
122+
- *All* validation done by CRDs. NKG will only repeat the validation that addresses (1) and (2) in the list above with
123+
extra rules required by NGINX but missing in the CRDs. For example, NKG will not ensure the limits of field values.
124+
- The validation done by the webhook (because it is done in the previous step).
125+
126+
If a resource is invalid, NKG will report the error in its status.
127+
128+
### Summary of Validation
129+
130+
The table below summarizes the validation methods NKG will use. Any Gateway API resource will be validated by the
131+
following methods in order of their appearance in the table.
132+
133+
| Name | Type | Component | Scope | Feedback loop for errors | Can be bypassed? |
134+
|------------------------------|---------|-----------------------|-------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
135+
| CRD validation | OpenAPI | Kubernetes API server | Structure, field values | Kubernetes API server returns any errors a response for an API call. | Yes, if the CRDs are modified. |
136+
| Webhook validation | Go code | Gateway API webhook | Field values | Kubernetes API server returns any errors a response for an API call. | Yes, if the webhook is not installed, misconfigured, or running a different version. |
137+
| Re-run of webhook validation | Go code | NKG control plane | Field values | Errors are reported as Event for the resource. | No |
138+
| NGK-specific validation | Go code | NKG control plane | Field values | Errors are reported in the status of a resource after its creation/modification. | No |
139+
140+
Notes:
141+
142+
* The amount and the extent of the validation should allow multiple application developers to share a single NKG (User
143+
story 1).
144+
* We expect that most of the validation problems will be caught by CRD and webhook validation and reported quickly to
145+
users as a response to a Kubernetes API call (User story 2).
146+
147+
### Evolution
148+
149+
NKG will support more resources:
150+
151+
- More Gateway API resources. For those, NGK will use the four validation methods from the table in the previous
152+
section.
153+
- Introduce NKG resources. For those, NKG will use CRD validation (the rules of which are fully controlled by us). The
154+
CRD validation will include the validation to prevent invalid NGINX configuration values and malicious values. Because
155+
the CRD validation can be bypassed, NKG control plane will need to run the same validation rules. In addition to that,
156+
NKG control plane will run any extra validation not possible to define via CRDs.
157+
158+
We will not introduce any NKG webhook in the cluster (it adds operational complexity for the cluster admin and is a
159+
source of potential downtime -- a webhook failure disables CRUD operations on the relevant resources) unless we find
160+
good reasons for that.
161+
162+
### Upgrades
163+
164+
Since NKG will use the validation package from the Gateway API project, when a new release happens, we will need to
165+
upgrade the dependency and release a new version of NKG, provided that the validation code changed. However, if it did
166+
not change, we do not need to release a new version. Note: other things from a new Gateway API release might prompt us
167+
to release a new version like supporting a new field. See also
168+
[GEP-922](https://gateway-api.sigs.k8s.io/geps/gep-922/#).
169+
170+
### Reliability
171+
172+
NGK processes two kinds of transactions:
173+
174+
* *Data plane transactions*. NGINX handles requests from clients that want to connect to applications exposed through
175+
NKG.
176+
* *Control plane transactions*. NKG handles configuration requests (ex. a new HTTPRoute is created) from NKG users.
177+
178+
Invalid user input makes NGINX config invalid, which means NGINX will fail to reload, which will prevent any new control
179+
plane transactions until that invalid value is fixed or removed. The proposed design addresses this issue by preventing
180+
NKG from generating invalid NGINX configuration.
181+
182+
However, in case of bugs in the NKG validation code, NKG might still generate an invalid NGINX config. When that
183+
happens, NGINX will fail to reload, but it will continue to use the last known valid config, so that the data plane
184+
transactions will not be stopped. This situation must be reported to both the cluster admin and the app developers.
185+
However, this is out of the scope of this design doc.
186+
187+
### Security
188+
189+
The proposed design ensures that the configuration values are properly validated before reaching NGINX config, which
190+
will prevent a malicious user from misusing them. For example, it will not be possible to inject NGINX configuration
191+
which can turn it into a web server serving the contents of the NKG data plane container file system.
192+
193+
## Alternatives Considered
194+
195+
### Utilize CRD Validation
196+
197+
It is [possible](https://github.com/hasheddan/k8s-cr-validator) to run CRD validation from Go code. However, this will
198+
require NKG to be shipped with the Gateway API CRDs, which will increase the coupling between NKG and the Gateway API
199+
version.
200+
201+
Additionally, the extra benefits are not clear: the validation proposed in this design document should adequately
202+
address reliability and security issues. Also, disabling CRD validation in the API server is not easy for an application
203+
developer -- they need to be a cluster admin to update the CRDs in the cluster.
204+
205+
At the same time, if a [convenient validation package](https://github.com/kubernetes-sigs/gateway-api/issues/926)
206+
that includes CRD validation is developed, we will revisit the design.
207+
208+
### Write NKG-specific Validation Rules in Validation Language
209+
210+
It is possible to define validation rules in an expression language like [CEL](https://github.com/google/cel-spec). NKG
211+
can load those rules, compile and run them.
212+
213+
Because (1) we need to define validation rules only to parts of Gateway API resources and (2) it is not necessary to
214+
load them on the fly, the approach will not provide any benefits over defining those rules in go.
215+
216+
At the same time, we might use CEL for validating future NGK CRDs (CEL
217+
is [supported](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules)
218+
in CRDs).

0 commit comments

Comments
 (0)