Closed
Description
Describe the bug
When NGF fails to collect product telemetry, it sends empty telemetry data.
To Reproduce
cd tests
make create-kind-cluster
make build-images load-images TAG=$(whoami) TELEMETRY_ENDPOINT=otel-collector-opentelemetry-collector.collector.svc.cluster.local:4317 TELEMETRY_ENDPOINT_INSECURE=true
helm install otel-collector open-telemetry/opentelemetry-collector -f suite/manifests/telemetry/collector-values.yaml -n collector --create-namespace
Deploy NGF:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
cd ..
helm install my-release ./deploy/helm-chart --create-namespace --wait --set service.type=NodePort --set nginxGateway.image.repository=nginx-gateway-fabric --set nginxGateway.image.tag=$(whoami) --set nginxGateway.image.pullPolicy=Never --set nginx.image.repository=nginx-gateway-fabric/nginx --set nginx.image.tag=$(whoami) --set nginx.image.pullPolicy=Never -n nginx-gateway
Edit NGF cluster role - remove rbac to list nodes:
kubectl edit clusterrole my-release-nginx-gateway-fabric
Remove:
44 - apiGroups:
45 - ""
46 resources:
47 - nodes
48 verbs:
49 - list
Delete NGF pod to re-create a new one:
kubectl -n nginx-gateway delete pod <pod-name>
Look at NGF pod logs, it should fail to collect telemetry because of RBAC changes:
{"level":"error","ts":"2024-03-19T21:21:12Z","logger":"telemetryJob","msg":"Failed to collect telemetry data","error":"failed to collect cluster information: failed to get NodeList: nodes is forbidden: User \"system:serviceaccount:nginx-gateway:my-release-nginx-gateway-fabric\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.createTelemetryJob.CreateTelemetryJobWorker.func4\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/telemetry/job_worker.go:29\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables.(*CronJob).Start\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables/cronjob.go:53\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tsigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
Look at the collector logs:
kubectl -n collector logs <otel-collector-pod-name> | grep "dataType: Str(ngf-product-telemetry)" -A 19
-> dataType: Str(ngf-product-telemetry)
-> ImageSource: Str(local)
-> ProjectName: Str(NGF)
-> ProjectVersion: Str(edge)
-> ProjectArchitecture: Str(amd64)
-> ClusterID: Str(ced72774-ef05-403c-9a91-2acffc9c386f)
-> ClusterVersion: Str(1.29.2)
-> ClusterPlatform: Str(kind)
-> InstallationID: Str(43a0a1be-919c-417b-b85e-782adb1e3f39)
-> ClusterNodeCount: Int(1)
-> FlagNames: Slice(["config","gateway","gateway-api-experimental-features","gateway-ctlr-name","gatewayclass","health-disable","health-port","help","leader-election-disable","leader-election-lock-name","metrics-disable","metrics-port","metrics-secure-serving","nginx-plus","product-telemetry-disable","service","update-gatewayclass-status","usage-report-cluster-name","usage-report-secret","usage-report-server-url","usage-report-skip-verify"])
-> FlagValues: Slice(["user-defined","default","false","user-defined","user-defined","false","default","false","false","user-defined","false","default","false","false","false","user-defined","true","default","default","default","false"])
-> GatewayCount: Int(0)
-> GatewayClassCount: Int(1)
-> HTTPRouteCount: Int(0)
-> SecretCount: Int(0)
-> ServiceCount: Int(0)
-> EndpointCount: Int(0)
-> NGFReplicaCount: Int(1)
{"kind": "exporter", "data_type": "traces", "name": "debug"}
--
-> dataType: Str(ngf-product-telemetry)
-> ImageSource: Str()
-> ProjectName: Str()
-> ProjectVersion: Str()
-> ProjectArchitecture: Str()
-> ClusterID: Str()
-> ClusterVersion: Str()
-> ClusterPlatform: Str()
-> InstallationID: Str()
-> ClusterNodeCount: Int(0)
-> FlagNames: Slice([])
-> FlagValues: Slice([])
-> GatewayCount: Int(0)
-> GatewayClassCount: Int(0)
-> HTTPRouteCount: Int(0)
-> SecretCount: Int(0)
-> ServiceCount: Int(0)
-> EndpointCount: Int(0)
-> NGFReplicaCount: Int(0)
{"kind": "exporter", "data_type": "traces", "name": "debug"}
Note how the second report from the new pod sends empty data.
Expected behavior
In case of error on collection, NGF should send any telemetry.
Your environment
- NGF Edge version
Additional context
Add any other context about the problem here. Any log files you want to share.