Description
Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.
Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.
The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).
Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.
And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.
It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.