Description
Summary
Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does
Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper).
Details
client-go currently exposes the following hooks that a metrics collector can register to https://github.com/kubernetes/client-go/blob/v0.33.0/tools/metrics/metrics.go#L114-L127:
Metric Name | Type | Dimensions | Description |
---|---|---|---|
rest_client_request_duration_seconds |
Histogram | verb , host |
Request latency in seconds. Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0] |
rest_client_dns_resolution_duration_seconds |
Histogram | host |
DNS resolver latency in seconds. Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0] |
rest_client_request_size_bytes |
Histogram | verb , host |
Request size in bytes. Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216] |
rest_client_response_size_bytes |
Histogram | verb , host |
Response size in bytes. Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216] |
rest_client_rate_limiter_duration_seconds |
Histogram | verb , host |
Client-side rate limiter latency in seconds. Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0] |
rest_client_requests_total |
Counter | code , method , host |
Number of HTTP requests. |
rest_client_request_retries_total |
Counter | code , verb , host |
Number of request retries. |
rest_client_transport_cache_entries |
Gauge | (none) | Number of transport entries in the internal cache. |
rest_client_transport_create_calls_total |
Counter | result |
Number of calls to get a new transport, partitioned by the result of the operation. |
Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total
. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host
dimension (which is presumably just however many apiserver host:port
s you have).
Proposal
-
controller-runtime starts exposing all of the listed metrics (by copying them from k8s.io/component-base) in controller-runtime by default.
-
Existing
rest_client_requests_total
metric should remain unmodified. -
ExecPluginCalls
hook (i.e.rest_client_exec_plugin_call_total
metric) should be left out as it is very rarely if ever useful for a controller process.
Considerations
-
Stability: ALL of the metrics listed above are listed in
ALPHA
stage in component-base and in k8s.io Metrics Documentation, presumably for components likekube-scheduler
,kube-controller-manager
etc. Do we also offer them as stable? Or do we break users later? -
Cardinality: Some histogram metrics have
10-12 buckets
. In a large cluster setup with10 apiservers
x4 verbs
, it can easily reach 400+ time series per metric (still bounded though). -
Future improvements: Client-go offers a
url
value in one of the hook functions. Thisurl
is actually a value that's free of resource {namespace,name} (i.e. it's bounded cardinality for us!) but is available only in one metric hook😢.component-base
basically uses thaturl.URL
value to find thehost
label.However, if
client-go
some day starts providingurl
label for every metric, it would be even more useful, but we'd likely need to break the metrics.
/kind design
/cc @alvaroaleman