|
| 1 | +# Resource Validation |
| 2 | + |
| 3 | +NGINX Kubernetes Gateway (NKG) must validate Gateway API resources for reliability, security, and conformity with the |
| 4 | +Gateway API specification. |
| 5 | + |
| 6 | +## Background |
| 7 | + |
| 8 | +### Why Validate? |
| 9 | + |
| 10 | +NKG transforms the Gateway API resources into NGINX configuration. Before transforming a resource, NKG needs to ensure |
| 11 | +its validity, which is important for the following reasons: |
| 12 | + |
| 13 | +1. *To prevent an invalid value from propagating into NGINX configuration*. For example, the URI in a path-based routing |
| 14 | + rule. The propagating has the following consequences: |
| 15 | + 1. Invalid input will make NGINX fail to reload. Moreover, until the corresponding invalid config is removed from |
| 16 | + NGINX configuration, NKG will not be able to reload NGINX for any future configuration changes. This affects the |
| 17 | + reliability of NKG. |
| 18 | + 2. Malicious input can breach the security of NKG. For example, if a malicious user can insert raw NGINX config ( |
| 19 | + something similar to an SQL injection), they can configure NGINX to serve the files on the container filesystem. |
| 20 | + This affects the security of NKG. |
| 21 | +2. *To conform to the Gateway API spec*. For example, if an HTTPRoute configures an unsupported filter, an |
| 22 | + implementation like NKG needs to "set Accepted Condition for the Route to `status: False`, with a Reason |
| 23 | + of `UnsupportedValue`". |
| 24 | + |
| 25 | +### Validation by the Gateway API Project |
| 26 | + |
| 27 | +To help the implementations with the validation, the Gateway API already includes: |
| 28 | + |
| 29 | +* *The OpenAPI scheme with validation rules in the Gateway API CRDs*. It enforces the structure (for example, the field |
| 30 | + X must be a string) and the contents of the fields (for example, field Y only allows values 'a' and 'b'). |
| 31 | + Additionally, it enforces the limits like max lengths on field values. Note: |
| 32 | + Kubernetes API server enforces this validation. To bypass it, a user needs to change the CRDs. |
| 33 | +* *The webhook validation*. This validation is written in go and run as part of the webhook, which is included in the |
| 34 | + Gateway API installation files. The validation covers additional logic, not possible to implement in the CRDs. It does |
| 35 | + not repeat the validation from the CRDs. Note: a user can bypass this validation if the webhook is not installed. |
| 36 | + |
| 37 | +However, the built-in validation rules do not cover all validation needs of NKG: |
| 38 | + |
| 39 | +* The rules are not enough for NGINX. For example, the validation rule for the |
| 40 | + `value` of the path in a path-based routing rule allows symbols like `;`, `{` |
| 41 | + and `}`, which can break NGINX configuration for the |
| 42 | + corresponding [location](https://nginx.org/en/docs/http/ngx_http_core_module.html#location) block. |
| 43 | +* The rules don't cover unsupported field cases. For example, the webhook does not know which filters are implemented by |
| 44 | + NKG, thus it cannot generate an appropriate error for NKG. |
| 45 | + |
| 46 | +Additionally, as mentioned in [GEP-922](https://gateway-api.sigs.k8s.io/geps/gep-922/#implementers), |
| 47 | +"implementers must not rely on webhook or CRD validation as a security mechanism. If field values need to be escaped to |
| 48 | +secure an implementation, both webhook and CRD validation can be bypassed and cannot be relied on." |
| 49 | + |
| 50 | +## Requirements |
| 51 | + |
| 52 | +Design a validation mechanism for Gateway API resources. |
| 53 | + |
| 54 | +### Personas |
| 55 | + |
| 56 | +* *Cluster admin* who installs Gateway API (the CRDs and Webhook), installs NKG, creates Gateway and GatewayClass |
| 57 | + resources. |
| 58 | +* *Application developer* who creates HTTPRoutes and other routes. |
| 59 | + |
| 60 | +### User Stories |
| 61 | + |
| 62 | +1. As a cluster admin, I'd like to share NKG among multiple application developers, specifically in a way that invalid |
| 63 | + resources of one developer do not affect on the resources of the other developers. |
| 64 | +2. As a cluster admin/application developer, I expect that NKG rejects any invalid resources I create and I am able to |
| 65 | + see the reasons (errors) for that. |
| 66 | + |
| 67 | +### Goals |
| 68 | + |
| 69 | +* Ensure that NKG continues to work and/or fails predictably in the face of invalid input. |
| 70 | +* Ensure that both cluster admin and application developers can see the validation errors reported about the resource |
| 71 | + they create (own). |
| 72 | +* For the best UX, minimize the feedback loop: users should be able to see most of the validation errors reported by a |
| 73 | + Kubernetes API server during a CRUD operation on a resource. |
| 74 | +* Ensure that the validation mechanism conforms to the Gateway API spec. |
| 75 | + |
| 76 | +### Non-Goals |
| 77 | + |
| 78 | +* Validation of non-Gateway API resources: Secrets, EndpointSlices. (For example, a TLS Secret resource might include a |
| 79 | + non-valid TLS cert that will make NGINX fail to reload). |
| 80 | + |
| 81 | +## Design |
| 82 | + |
| 83 | +We will introduce two validation methods to be run by NKG control plane: |
| 84 | + |
| 85 | +1. Re-run of the Gateway API webhook validation |
| 86 | +2. NKG-specific field validation |
| 87 | + |
| 88 | +### Re-run of Webhook Validation |
| 89 | + |
| 90 | +Before processing a resource, NKG will validate it using the functions from |
| 91 | +the [validation package](https://github.com/kubernetes-sigs/gateway-api/tree/b241afc88e68c952cc0a59a5c72a51358dc2bada/apis/v1beta1/validation) |
| 92 | +from the Gateway API. This will ensure that the webhook validation cannot be bypassed (it can be bypassed if the webhook |
| 93 | +is not installed, misconfigured, or running a different version), and it will allow us to avoid repeating the same |
| 94 | +validation in our code. |
| 95 | + |
| 96 | +If a resource is invalid: |
| 97 | + |
| 98 | +* NKG will not process it -- it will treat it as if the resource didn't exist. This also means that if the resource was |
| 99 | + updated from a valid to an invalid state, NKG will also ignore any previous valid state. For example, it will remove |
| 100 | + the generation configuration for an HTTPRoute resource. |
| 101 | +* NKG will report the validation error as a |
| 102 | + Warning [Event](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) |
| 103 | + for that resource. The Event message will describe the error and explain that the resource was ignored. We chose to |
| 104 | + report an Event instead of updating the status, because to update the status, NKG first needs to look inside the |
| 105 | + resource to determine whether it belongs to it or not. However, since the webhook validation applies to all parts of |
| 106 | + the spec of resource, it means NKG has to look inside the invalid resource and parse potentially invalid parts. To |
| 107 | + avoid that, NKG will report an Event. The owner of the resource will be able to see the Event. |
| 108 | +* NGK will also report the validation error in the NGK logs. |
| 109 | + |
| 110 | +### NKG-specific validation |
| 111 | + |
| 112 | +After re-running the webhook validation, NKG will run NKG-specific validation, written in go. |
| 113 | + |
| 114 | +NKG-specific validation will: |
| 115 | + |
| 116 | +1. Ensure field values are considered valid by NGINX (cannot make NGINX fail to reload). |
| 117 | +2. Ensure valid field values do not include any malicious configuration. |
| 118 | +3. Report an error if an unsupported field is present in a resource (as the Gateway API spec prescribes). |
| 119 | + |
| 120 | +NKG-specific validation will not include: |
| 121 | + |
| 122 | +- *All* validation done by CRDs. NKG will only repeat the validation that addresses (1) and (2) in the list above with |
| 123 | + extra rules required by NGINX but missing in the CRDs. For example, NKG will not ensure the limits of field values. |
| 124 | +- The validation done by the webhook (because it is done in the previous step). |
| 125 | + |
| 126 | +If a resource is invalid, NKG will report the error in its status. |
| 127 | + |
| 128 | +### Summary of Validation |
| 129 | + |
| 130 | +The table below summarizes the validation methods NKG will use. Any Gateway API resource will be validated by the |
| 131 | +following methods in order of their appearance in the table. |
| 132 | + |
| 133 | +| Name | Type | Component | Scope | Feedback loop for errors | Can be bypassed? | |
| 134 | +|------------------------------|---------|-----------------------|-------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------| |
| 135 | +| CRD validation | OpenAPI | Kubernetes API server | Structure, field values | Kubernetes API server returns any errors a response for an API call. | Yes, if the CRDs are modified. | |
| 136 | +| Webhook validation | Go code | Gateway API webhook | Field values | Kubernetes API server returns any errors a response for an API call. | Yes, if the webhook is not installed, misconfigured, or running a different version. | |
| 137 | +| Re-run of webhook validation | Go code | NKG control plane | Field values | Errors are reported as Event for the resource. | No | |
| 138 | +| NGK-specific validation | Go code | NKG control plane | Field values | Errors are reported in the status of a resource after its creation/modification. | No | |
| 139 | + |
| 140 | +Notes: |
| 141 | + |
| 142 | +* The amount and the extent of the validation should allow multiple application developers to share a single NKG (User |
| 143 | + story 1). |
| 144 | +* We expect that most of the validation problems will be caught by CRD and webhook validation and reported quickly to |
| 145 | + users as a response to a Kubernetes API call (User story 2). |
| 146 | + |
| 147 | +### Evolution |
| 148 | + |
| 149 | +NKG will support more resources: |
| 150 | + |
| 151 | +- More Gateway API resources. For those, NGK will use the four validation methods from the table in the previous |
| 152 | + section. |
| 153 | +- Introduce NKG resources. For those, NKG will use CRD validation (the rules of which are fully controlled by us). The |
| 154 | + CRD validation will include the validation to prevent invalid NGINX configuration values and malicious values. Because |
| 155 | + the CRD validation can be bypassed, NKG control plane will need to run the same validation rules. In addition to that, |
| 156 | + NKG control plane will run any extra validation not possible to define via CRDs. |
| 157 | + |
| 158 | +We will not introduce any NKG webhook in the cluster (it adds operational complexity for the cluster admin and is a |
| 159 | +source of potential downtime -- a webhook failure disables CRUD operations on the relevant resources) unless we find |
| 160 | +good reasons for that. |
| 161 | + |
| 162 | +### Upgrades |
| 163 | + |
| 164 | +Since NKG will use the validation package from the Gateway API project, when a new release happens, we will need to |
| 165 | +upgrade the dependency and release a new version of NKG, provided that the validation code changed. However, if it did |
| 166 | +not change, we do not need to release a new version. Note: other things from a new Gateway API release might prompt us |
| 167 | +to release a new version like supporting a new field. See also |
| 168 | +[GEP-922](https://gateway-api.sigs.k8s.io/geps/gep-922/#). |
| 169 | + |
| 170 | +### Reliability |
| 171 | + |
| 172 | +NGK processes two kinds of transactions: |
| 173 | + |
| 174 | +* *Data plane transactions*. NGINX handles requests from clients that want to connect to applications exposed through |
| 175 | + NKG. |
| 176 | +* *Control plane transactions*. NKG handles configuration requests (ex. a new HTTPRoute is created) from NKG users. |
| 177 | + |
| 178 | +Invalid user input makes NGINX config invalid, which means NGINX will fail to reload, which will prevent any new control |
| 179 | +plane transactions until that invalid value is fixed or removed. The proposed design addresses this issue by preventing |
| 180 | +NKG from generating invalid NGINX configuration. |
| 181 | + |
| 182 | +However, in case of bugs in the NKG validation code, NKG might still generate an invalid NGINX config. When that |
| 183 | +happens, NGINX will fail to reload, but it will continue to use the last known valid config, so that the data plane |
| 184 | +transactions will not be stopped. This situation must be reported to both the cluster admin and the app developers. |
| 185 | +However, this is out of the scope of this design doc. |
| 186 | + |
| 187 | +### Security |
| 188 | + |
| 189 | +The proposed design ensures that the configuration values are properly validated before reaching NGINX config, which |
| 190 | +will prevent a malicious user from misusing them. For example, it will not be possible to inject NGINX configuration |
| 191 | +which can turn it into a web server serving the contents of the NKG data plane container file system. |
| 192 | + |
| 193 | +## Alternatives Considered |
| 194 | + |
| 195 | +### Utilize CRD Validation |
| 196 | + |
| 197 | +It is [possible](https://github.com/hasheddan/k8s-cr-validator) to run CRD validation from Go code. However, this will |
| 198 | +require NKG to be shipped with the Gateway API CRDs, which will increase the coupling between NKG and the Gateway API |
| 199 | +version. |
| 200 | + |
| 201 | +Additionally, the extra benefits are not clear: the validation proposed in this design document should adequately |
| 202 | +address reliability and security issues. Also, disabling CRD validation in the API server is not easy for an application |
| 203 | +developer -- they need to be a cluster admin to update the CRDs in the cluster. |
| 204 | + |
| 205 | +At the same time, if a [convenient validation package](https://github.com/kubernetes-sigs/gateway-api/issues/926) |
| 206 | +that includes CRD validation is developed, we will revisit the design. |
| 207 | + |
| 208 | +### Write NKG-specific Validation Rules in Validation Language |
| 209 | + |
| 210 | +It is possible to define validation rules in an expression language like [CEL](https://github.com/google/cel-spec). NKG |
| 211 | +can load those rules, compile and run them. |
| 212 | + |
| 213 | +Because (1) we need to define validation rules only to parts of Gateway API resources and (2) it is not necessary to |
| 214 | +load them on the fly, the approach will not provide any benefits over defining those rules in go. |
| 215 | + |
| 216 | +At the same time, we might use CEL for validating future NGK CRDs (CEL |
| 217 | +is [supported](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules) |
| 218 | +in CRDs). |
0 commit comments