Skip to content

Commit f3b103b

Browse files
committed
modify troubleshooting guide
1 parent 3fafa2e commit f3b103b

File tree

1 file changed

+315
-17
lines changed

1 file changed

+315
-17
lines changed

site/content/how-to/monitoring/troubleshooting.md

+315-17
Original file line numberDiff line numberDiff line change
@@ -65,54 +65,350 @@ LAST SEEN TYPE REASON OBJECT
6565
5s Warning ResourceDeleted nginxgateway/ngf-config NginxGateway configuration was deleted; using defaults
6666
```
6767

68+
##### Get shell access to containers
69+
70+
Getting shell access to containers allows developers and operators to view the environment of a running container, see its logs or diagnose any problems. To get shell access to the NGINX container, use `kubectl exec`:
71+
72+
```shell
73+
kubectl exec -it [-n namespace] <ngf-pod-name> -c nginx /bin/sh
74+
```
75+
6876
##### Logs
6977

7078
Logs from the NGINX Gateway Fabric control plane and data plane can contain information that isn't available to status or events. These can include errors in processing or passing traffic.
7179

72-
To see logs for the control plane container:
80+
1. To see logs for the control plane container:
7381

7482
```shell
75-
kubectl -n nginx-gateway logs <ngf-pod-name> -c nginx-gateway
83+
kubectl [-n namespace] logs <ngf-pod-name> -c nginx-gateway
7684
```
7785

7886
To see logs for the data plane container:
7987

8088
```shell
81-
kubectl -n nginx-gateway logs <ngf-pod-name> -c nginx
89+
kubectl [-n namespace] logs <ngf-pod-name> -c nginx
90+
```
91+
92+
1. To filter out error logs for control plane and data plane containers:
93+
94+
For _nginx-gateway_ container, you can `grep` for the word `error` or change the log level to `error` by following steps in [Modify logging levels](#modify-logging-levels).
95+
96+
```shell
97+
kubectl [-n namespace] logs <ngf-pod-name> -c nginx-gateway | grep error
98+
```
99+
100+
For example, an error message when telemetry is not enabled for NGINX Plus installations:
101+
102+
```text
103+
kubectl logs -n nginx-gateway nginx-gateway-nginx-gateway-fabric-77f8746996-j6z6v | grep error
104+
Defaulted container "nginx-gateway" out of: nginx-gateway, nginx
105+
{"level":"error","ts":"2024-06-13T18:22:16Z","logger":"usageReporter","msg":"Usage reporting must be enabled when using NGINX Plus; redeploy with usage reporting enabled","error":"usage reporting not enabled","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.createUsageWarningJob.func1\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/manager.go:616\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables.(*CronJob).Start\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables/cronjob.go:53\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tsigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}
106+
```
107+
108+
For _nginx_ container:
109+
110+
```shell
111+
kubectl [-n namespace] logs <ngf-pod-name> -c nginx-gateway | grep emerg
82112
```
83113

114+
For example, if a variable is too long, NGINX may display such an error message:
115+
116+
```text
117+
kubectl logs -n nginx-gateway ngf-nginx-gateway-fabric-bb8598998-jwk2m -c nginx | grep emerg
118+
2024/06/13 20:04:17 [emerg] 27#27: too long parameter, probably missing terminating """ character in /etc/nginx/conf.d/http.conf:78
119+
```
120+
121+
1. NGINX access logs are files that record all requests processed by the NGINX server. These logs provide detailed information about each request, which can be useful for troubleshooting, and analyzing web traffic.
122+
To view the access logs, get shell access to your NGINX container using the [steps](#get-shell-access-to-containers). The access logs are located in the file `/var/log/nginx/access.log` in the NGINX container.
123+
84124
You can see logs for a crashed or killed container by adding the `-p` flag to the above commands.
85125

126+
##### If NGINX Gateway Fabric Pod is not running or ready
127+
128+
To understand why NGINX Gateway Fabric Pod has not started running or is not ready, first step is to check the state of the pod to get a detailed information about the current status and events happening in the pod. To do this, use `kubectl describe`:
129+
130+
```shell
131+
kubectl describe pod <ngf-pod-name> [-n namespace]
132+
```
133+
134+
The pod description includes details about the image name, tags, current status, and environment variables. Please verify that these details match your setup and cross-check with the events to ensure everything is functioning as expected. For example, the pod below has two containers that are running and the events reflect the same.
135+
136+
```text
137+
Containers:
138+
nginx-gateway:
139+
Container ID: containerd://06c97a9de938b35049b7c63e251418395aef65dd1ff996119362212708b79cab
140+
Image: nginx-gateway-fabric:sa.choudhary
141+
Image ID: docker.io/library/import-2024-06-13@sha256:1460d63bd8a352a6e455884d7ebf51ce9c92c512cb43b13e44a1c3e3e6a08918
142+
Ports: 9113/TCP, 8081/TCP
143+
Host Ports: 0/TCP, 0/TCP
144+
State: Running
145+
Started: Thu, 13 Jun 2024 11:47:46 -0600
146+
Ready: True
147+
Restart Count: 0
148+
Readiness: http-get http://:health/readyz delay=3s timeout=1s period=1s #success=1 #failure=3
149+
Environment:
150+
POD_IP: (v1:status.podIP)
151+
POD_NAMESPACE: nginx-gateway (v1:metadata.namespace)
152+
POD_NAME: ngf-nginx-gateway-fabric-66dd665756-zh7d7 (v1:metadata.name)
153+
nginx:
154+
Container ID: containerd://c2f3684fd8922e4fac7d5707ab4eb5f49b1f76a48893852c9a812cd6dbaa2f55
155+
Image: nginx-gateway-fabric/nginx:sa.choudhary
156+
Image ID: docker.io/library/import-2024-06-13@sha256:c9a02cb5665c6218373f8f65fc2c730f018d0ca652ae827cc913a7c6e9db6f45
157+
Ports: 80/TCP, 443/TCP
158+
Host Ports: 0/TCP, 0/TCP
159+
State: Running
160+
Started: Thu, 13 Jun 2024 11:47:46 -0600
161+
Ready: True
162+
Restart Count: 0
163+
Environment: <none>
164+
Events:
165+
Type Reason Age From Message
166+
---- ------ ---- ---- -------
167+
Normal Scheduled 40s default-scheduler Successfully assigned nginx-gateway/ngf-nginx-gateway-fabric-66dd665756-zh7d7 to kind-control-plane
168+
Normal Pulled 40s kubelet Container image "nginx-gateway-fabric:sa.choudhary" already present on machine
169+
Normal Created 40s kubelet Created container nginx-gateway
170+
Normal Started 39s kubelet Started container nginx-gateway
171+
Normal Pulled 39s kubelet Container image "nginx-gateway-fabric/nginx:sa.choudhary" already present on machine
172+
Normal Created 39s kubelet Created container nginx
173+
Normal Started 39s kubelet Started container nginx
174+
```
175+
176+
177+
### Modify logging levels
178+
179+
To debug NGINX Gateway Fabric, enable verbose logging by editing the `NginxGateway` configuration. This can be done either before or after deploying NGINX Gateway Fabric.
180+
181+
#### Modify log levels before deploying
182+
183+
1. If using manifests, edit `deploy/manifests/nginx-gateway.yaml` to update the logging level for `nginx-gateway-config`:
184+
185+
```yaml
186+
apiVersion: gateway.nginx.org/v1alpha1
187+
kind: NginxGateway
188+
metadata:
189+
name: nginx-gateway-config
190+
namespace: nginx-gateway
191+
labels:
192+
app.kubernetes.io/name: nginx-gateway
193+
app.kubernetes.io/instance: nginx-gateway
194+
app.kubernetes.io/version: "edge"
195+
spec:
196+
logging:
197+
level: debug
198+
```
199+
200+
1. If using helm, add `--set nginxGateway.config.logging.level=<log-level>` to your helm installation command.
201+
202+
#### Modify log levels after deploying
203+
204+
Once you have deployed NGINX Gateway Fabric, you can modify log levels by editing the config for NGINX Gateway as shown below:
205+
206+
```shell
207+
kubectl [-n namespace] edit nginxgateways ngf-config
208+
```
209+
210+
```yaml
211+
apiVersion: gateway.nginx.org/v1alpha1
212+
kind: NginxGateway
213+
metadata:
214+
annotations:
215+
meta.helm.sh/release-name: ngf
216+
meta.helm.sh/release-namespace: nginx-gateway
217+
creationTimestamp: "2024-06-12T18:35:05Z"
218+
generation: 1
219+
labels:
220+
app.kubernetes.io/instance: ngf
221+
app.kubernetes.io/managed-by: Helm
222+
app.kubernetes.io/name: nginx-gateway-fabric
223+
app.kubernetes.io/version: edge
224+
helm.sh/chart: nginx-gateway-fabric-1.3.0
225+
name: ngf-config
226+
namespace: nginx-gateway
227+
resourceVersion: "62293"
228+
uid: fa6d6a12-14e1-4168-95d5-595e7f63b270
229+
spec:
230+
logging:
231+
level: debug
232+
```
233+
86234
### NGINX fails to reload
87235

88236
#### Description
89237

238+
NGINX reload errors can occur for various reasons, including syntax errors in configuration files, permission issues, and more. To determine if NGINX has failed to reload, check logs for your _nginx-gateway_ and _nginx_ containers.
239+
You will see the following error in the _nginx-gateway_ logs `failed to reload NGINX:` followed by the reason for the failure. Similarly, you will see error logs in the _nginx_ container as `2024/06/12 14:25:11 [emerg] 12345#0: open() "/var/run/nginx.pid" failed (13: Permission denied)`.
240+
241+
To debug why your reload has failed, start with verifying the syntax of your configuration files by opening a shell in the NGINX container following these [steps](#get-shell-access-to-containers) and running `nginx -T`. If there are errors in your configuration file, the reload will fail and specify why it has failed.
242+
243+
### Understanding the generated config
244+
245+
Understanding the NGINX configuration is key for fixing issues because it shows how NGINX handles requests. This helps tweak settings to make sure NGINX behaves the way you want it to for your application. The configuration file is found at /etc/nginx/nginx.conf within your NGINX Container. To understand the usage of NGINX Directives in the configuration file, consult this list of [NGINX Directives](https://nginx.org/en/docs/dirindex.html).
246+
247+
In this section, we will see how the `nginx.conf` gets updated as we configure different services, deployments and routes with NGINX Gateway Fabric. In the configuration file, you'll often find several server blocks, each assigned to specific ports and server names. NGINX selects the appropriate server for a request and evaluates the URI against the location directives within that block. In cases, where no resources are defined, NGINX Gateway Fabric generates a basic configuration with a default server listening on port 80 for all requests and additional blocks to manage errors with status codes 500 or 502.
248+
249+
This is a default `server` block listening on port 80:
250+
251+
```text
252+
server {
253+
listen 80 default_server;
254+
255+
default_type text/html;
256+
return 404;
257+
}
258+
```
259+
260+
Once routes with path matches and rules are defined, the nginx.conf is updated accordingly to determine which location block will manage incoming requests. To demonstrate how `nginx.conf` is changed, lets create some resources:
261+
262+
1. A Gateway with single listener on port 80. The hostname specified is `*.example.com`, so all incoming requests matching that wildcard is accepted by this Gateway.
263+
2. A simple `coffee` application with hostname `cafe.example.com` and referenced to the Gateway we created.
264+
3. A HTTPRoute to expose `coffee` application outside the cluster using the listener created in step 1. The path and rule matches create different location blocks in `nginx.conf` to redirect requests as needed.
265+
266+
For example, this `coffee` route matches requests with path `/coffee` and type `prefix`. Lets see how the `nginx.conf` is modified.
267+
268+
```yaml
269+
apiVersion: gateway.networking.k8s.io/v1
270+
kind: HTTPRoute
271+
metadata:
272+
name: coffee
273+
spec:
274+
parentRefs:
275+
- name: gateway
276+
sectionName: http
277+
hostnames:
278+
- "cafe.example.com"
279+
rules:
280+
- matches:
281+
- path:
282+
type: PathPrefix
283+
value: /coffee
284+
backendRefs:
285+
- name: coffee
286+
port: 80
287+
```
288+
289+
The modified `nginx.conf`:
290+
291+
```shell
292+
server {
293+
listen 80 default_server;
294+
295+
default_type text/html;
296+
return 404;
297+
}
298+
299+
server {
300+
listen 80;
301+
302+
server_name cafe.example.com;
303+
304+
305+
location /coffee/ {
306+
proxy_set_header Host "$gw_api_compliant_host";
307+
proxy_set_header X-Forwarded-For "$proxy_add_x_forwarded_for";
308+
proxy_set_header Upgrade "$http_upgrade";
309+
proxy_set_header Connection "$connection_upgrade";
310+
proxy_http_version 1.1;
311+
proxy_pass http://default_coffee_80$request_uri;
312+
}
313+
314+
location = /coffee {
315+
proxy_set_header Host "$gw_api_compliant_host";
316+
proxy_set_header X-Forwarded-For "$proxy_add_x_forwarded_for";
317+
proxy_set_header Upgrade "$http_upgrade";
318+
proxy_set_header Connection "$connection_upgrade";
319+
proxy_http_version 1.1;
320+
proxy_pass http://default_coffee_80$request_uri;
321+
}
322+
323+
location / {
324+
return 404 "";
325+
}
326+
327+
}
328+
upstream default_coffee_80 {
329+
random two least_conn;
330+
zone default_coffee_80 512k;
331+
332+
server 10.244.0.13:8080;
333+
}
334+
```
335+
336+
Some key things to note here:
337+
338+
1. A new `server` block is created with the hostname of the HTTPRoute. When a request is sent to this hostname, it will be handled by this `server` block.
339+
2. Within the `server` block, three new `location` blocks are added for *coffee*, each with distinct prefix and exact paths. Requests directed to the *coffee* application with a path prefix `/coffee/hello` will be managed by the first location block, while those with an exact path `/coffee` will be handled by the second location block. Any other requests not recognized by the server block for this hostname will default to the third location block, returning a 404 Not Found status.
340+
3. Each `location` block has headers and directives that configure the NGINX proxy to forward requests to the `/coffee` path correctly, preserving important client information and ensuring compatibility with the upstream server.
341+
4. The `upstream` block in the given NGINX configuration defines a group of backend servers and configures how NGINX should load balance requests among them.
342+
343+
Now let's check the behaviour when curl request is sent to the `coffee` application:
344+
345+
Matches location /coffee/ block
346+
347+
```shell
348+
curl --resolve cafe.example.com:$GW_PORT:$GW_IP http://cafe.example.com:$GW_PORT/coffee/hello
349+
Handling connection for 8080
350+
Server address: 10.244.0.13:8080
351+
Server name: coffee-56b44d4c55-hwpkp
352+
Date: 13/Jun/2024:22:51:52 +0000
353+
URI: /coffee/hello
354+
Request ID: 21fc2baad77337065e7cf2cd57e04383
355+
```
356+
357+
Matches location = /coffee block
358+
359+
```shell
360+
curl --resolve cafe.example.com:$GW_PORT:$GW_IP http://cafe.example.com:$GW_PORT/coffee
361+
Handling connection for 8080
362+
Server address: 10.244.0.13:8080
363+
Server name: coffee-56b44d4c55-hwpkp
364+
Date: 13/Jun/2024:22:51:40 +0000
365+
URI: /coffee
366+
Request ID: 4d8d719e95063303e290ad74ecd7339f
367+
```
368+
369+
Matches location / block
370+
371+
```shell
372+
curl --resolve cafe.example.com:$GW_PORT:$GW_IP http://cafe.example.com:$GW_PORT/
373+
Handling connection for 8080
374+
<html>
375+
<head><title>404 Not Found</title></head>
376+
<body>
377+
<center><h1>404 Not Found</h1></center>
378+
<hr><center>nginx/1.25.4</center>
379+
</body>
380+
```
381+
382+
#### Metrics for Troubleshooting
383+
384+
Metrics can be useful to identify performance bottlenecks and pinpoint areas of high resource consumption within NGINX Gateway Fabric. To setup metrics collection, refer to this [guide]({{< relref "prometheus.md" >}}). The metrics dashboard will help you understand problems with the way NGINX Gateway Fabric is setup or potential issues that could show up with time.
385+
386+
For example, metrics `nginx_reloads_total` and `nginx_reload_errors_total` offer valuable insights into the system's stability and reliability. A high `nginx_reloads_total` value indicates frequent updates or configuration changes, while a high `nginx_reload_errors_total` value suggests issues with the configuration or other problems preventing successful reloads. Monitoring these metrics helps identify and resolve configuration errors, ensuring consistent service reliability.
387+
388+
In such situations, it's advisable to review the logs of both NGINX and NGINX Gateway containers for any potential error messages. Additionally, verify the configured resources to ensure they are in a valid state.
389+
390+
### Common Errors
391+
392+
##### Insufficient Privileges errors
393+
90394
Depending on your environment's configuration, the control plane may not have the proper permissions to reload NGINX. The NGINX configuration will not be applied and you will see the following error in the _nginx-gateway_ logs:
91395

92396
`failed to reload NGINX: failed to send the HUP signal to NGINX main: operation not permitted`
93397

94-
#### Resolution
95-
96-
To resolve this issue you will need to set `allowPrivilegeEscalation` to `true`.
398+
To **resolve** this issue you will need to set `allowPrivilegeEscalation` to `true`.
97399

98400
- If using Helm, you can set the `nginxGateway.securityContext.allowPrivilegeEscalation` value.
99401
- If using the manifests directly, you can update this field under the `nginx-gateway` container's `securityContext`.
100402

101-
### Usage Reporting errors
102-
103-
#### Description
403+
##### Usage Reporting errors
104404

105405
If using NGINX Gateway Fabric with NGINX Plus as the data plane, you will see the following error in the _nginx-gateway_ logs if you have not enabled Usage Reporting:
106406

107407
`usage reporting not enabled`
108408

109-
#### Resolution
110-
111-
To resolve this issue, enable Usage Reporting by following the [Usage Reporting]({{< relref "installation/usage-reporting.md" >}}) guide.
409+
To **resolve** this issue, enable Usage Reporting by following the [Usage Reporting]({{< relref "installation/usage-reporting.md" >}}) guide.
112410

113-
### 413 Request Entity Too Large
114-
115-
#### Description
411+
##### 413 Request Entity Too Large
116412

117413
If you receive the following error:
118414

@@ -133,7 +429,9 @@ Or view the following error message in the NGINX logs:
133429
```
134430

135431
The request body exceeds the [client_max_body_size](https://nginx.org/en/docs/http/ngx_http_core_module.html#client_max_body_size).
432+
To **resolve** this, you can configure the `client_max_body_size` using the `ClientSettingsPolicy` API. Read the [Client Settings Policy]({{< relref "how-to/traffic-management/client-settings.md" >}}) documentation for more information.
433+
136434

137-
#### Resolution
435+
### Further Reading
138436

139-
You can configure the `client_max_body_size` using the `ClientSettingsPolicy` API. Read the [Client Settings Policy]({{< relref "how-to/traffic-management/client-settings.md" >}}) documentation for more information.
437+
You can checkout the [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/debug-application/) for further assistance

0 commit comments

Comments
 (0)