Description
Metrics are created for the controller's workqueue, using Prometheus's default buckets. Unfortunately, the default buckets are poorly chosen for event processing.
The default buckets are tailored to broadly measure the response time (in seconds) of a network service.
This can easily result in metrics that are extremely coarse. In a controller I was working on today, every single reconcile was faster than the smallest bucket:
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.005"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.01"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.025"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.05"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.25"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="0.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="1"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="2.5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="5"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="10"} 31
controller_runtime_reconcile_time_seconds_bucket{controller="application",le="+Inf"} 31
It is possible to change the default buckets by modifying DefBuckets in init
, as my init
will be called after the DefBuckets
variable has been initialized, but before the controllermetrics package init
. But this is a very heavy handed brush, changing the defaults of all histograms.
I propose two paths forward:
- Allow the user to pass their own metrics to the controller, with the current collectors used if the none are provided.
- Alternatively, move the current package outside of the internal package, This will allow users to call Unregister and then assign a replacement.
My preference would be the first option, as the change to the system is would be more easily understood.