Skip to content

Commit b5f3440

Browse files
committed
Auto merge of #3531 - pietroalbini:metrics, r=jtgeibel
Add support for Prometheus metrics This PR adds support for collecting Prometheus metrics in crates.io, and adds some basic metrics to make sure everything works correctly. We'll want to add way more metrics after the PR is merged. Prometheus is the monitoring service used for most of the Rust infrastructure, and the infra team maintains an instance of it. Prometheus periodically scrapes an endpoint of the application and ingests all the metrics contained in it. It then provides a query interface to visualize metrics over time, and supports generating alerts based on the current metrics. An example query to detect when there are more than 100 crates published in an hour would be: ``` increase(cratesio_service_versions_total[1h]) ``` ## Service-level metrics Service-level metrics are available at `/api/private/metrics/service` and include all the things that are related to the service as a whole, instead of metrics specific to the crates.io backend instance. Example of these metrics could be the number of published crates ever, or how many jobs are present in the background queue. These metrics will be scraped by hitting `https://crates.io`, so it's likely that two requests will be served by different backend servers due to the load balancer. This means service-level metrics must never include any information specific to a single instance. Current output of that endpoint: ```bash # HELP cratesio_service_crates_total Number of crates ever published # TYPE cratesio_service_crates_total gauge cratesio_service_crates_total 2 # HELP cratesio_service_versions_total Number of versions ever published # TYPE cratesio_service_versions_total gauge cratesio_service_versions_total 4 ``` ## Instance-level metrics > **Note:** we can't collect instance-level metrics right now due to the Heroku load balancer. The infra team is looking into implementing a workaround, see [this document](https://paper.dropbox.com/doc/crates.io-monitoring--BJC8_vas_Jkav2St9QrFf__aAg-JWf5AxfJ1Nbc3lLuNcaTy). Instance-level metrics are available at `/api/private/metrics/instance` and include all the things that are specific to a single crates.io backend instance. Example of these metrics could be the state of the database pool, how many requests were processed, what the response time was, or how many downlaods are not counted yet. These metrics will be scraped by hitting each individual backend at the same time, and Prometheus will then aggregate the results to offer a complete picture in the dashboard. Metrics that represent the state of the whole system shouldn't be implemented as instance-level metrics, as otherwise the resulting metric will be aggregated from each backend. Current output of that endpoint: ```bash # HELP cratesio_instance_database_idle_conns Number of idle database connections in the pool # TYPE cratesio_instance_database_idle_conns gauge cratesio_instance_database_idle_conns{pool="follower"} 3 cratesio_instance_database_idle_conns{pool="primary"} 3 # HELP cratesio_instance_database_used_conns Number of used database connections in the pool # TYPE cratesio_instance_database_used_conns gauge cratesio_instance_database_used_conns{pool="follower"} 0 cratesio_instance_database_used_conns{pool="primary"} 0 # HELP cratesio_instance_requests_in_flight Number of requests currently being processed # TYPE cratesio_instance_requests_in_flight gauge cratesio_instance_requests_in_flight 1 # HELP cratesio_instance_requests_total Number of requests processed by this instance # TYPE cratesio_instance_requests_total counter cratesio_instance_requests_total 8 ``` ## Authentication To prevent third parties from scraping our metrics (which in the future could contain sensitive data), all the metrics endpoints are protected with HTTP authentication. Requests without an `Authorization` header matching the contents of the `METRICS_AUTHORIZATION_TOKEN` environment variable will be rejected. If the environment variable is missing metrics collection will be disabled, to prevent accidental leaks. r? `@jtgeibel`
2 parents 11707b6 + a6f52f8 commit b5f3440

File tree

20 files changed

+429
-6
lines changed

20 files changed

+429
-6
lines changed

Cargo.lock

Lines changed: 22 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ lettre = { version = "0.10.0-beta.3", default-features = false, features = ["fil
6868
license-exprs = "1.6"
6969
oauth2 = { version = "4.0.0-beta.1", default-features = false, features = ["reqwest"] }
7070
parking_lot = "0.11"
71+
prometheus = "0.12.0"
7172
rand = "0.8"
7273
reqwest = { version = "0.11", features = ["blocking", "gzip", "json"] }
7374
scheduled-thread-pool = "0.2.0"

src/app.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ use std::{sync::Arc, time::Duration};
66
use crate::downloads_counter::DownloadsCounter;
77
use crate::email::Emails;
88
use crate::github::GitHubClient;
9+
use crate::metrics::{InstanceMetrics, ServiceMetrics};
910
use diesel::r2d2;
1011
use oauth2::basic::BasicClient;
1112
use reqwest::blocking::Client;
@@ -40,6 +41,12 @@ pub struct App {
4041
/// Backend used to send emails
4142
pub emails: Emails,
4243

44+
/// Metrics related to the service as a whole
45+
pub service_metrics: ServiceMetrics,
46+
47+
/// Metrics related to this specific instance of the service
48+
pub instance_metrics: InstanceMetrics,
49+
4350
/// A configured client for outgoing HTTP requests
4451
///
4552
/// In production this shares a single connection pool across requests. In tests
@@ -141,6 +148,9 @@ impl App {
141148
config,
142149
downloads_counter: DownloadsCounter::new(),
143150
emails: Emails::from_environment(),
151+
service_metrics: ServiceMetrics::new().expect("could not initialize service metrics"),
152+
instance_metrics: InstanceMetrics::new()
153+
.expect("could not initialize instance metrics"),
144154
http_client,
145155
}
146156
}

src/config.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ pub struct Config {
2121
pub allowed_origins: Vec<String>,
2222
pub downloads_persist_interval_ms: usize,
2323
pub ownership_invitations_expiration_days: u64,
24+
pub metrics_authorization_token: Option<String>,
2425
}
2526

2627
impl Default for Config {
@@ -47,8 +48,10 @@ impl Default for Config {
4748
/// - `DATABASE_URL`: The URL of the postgres database to use.
4849
/// - `READ_ONLY_REPLICA_URL`: The URL of an optional postgres read-only replica database.
4950
/// - `BLOCKED_TRAFFIC`: A list of headers and environment variables to use for blocking
50-
///. traffic. See the `block_traffic` module for more documentation.
51+
/// traffic. See the `block_traffic` module for more documentation.
5152
/// - `DOWNLOADS_PERSIST_INTERVAL_MS`: how frequent to persist download counts (in ms).
53+
/// - `METRICS_AUTHORIZATION_TOKEN`: authorization token needed to query metrics. If missing,
54+
/// querying metrics will be completely disabled.
5255
fn default() -> Config {
5356
let api_protocol = String::from("https");
5457
let mirror = if dotenv::var("MIRROR").is_ok() {
@@ -156,6 +159,7 @@ impl Default for Config {
156159
})
157160
.unwrap_or(60_000), // 1 minute
158161
ownership_invitations_expiration_days: 30,
162+
metrics_authorization_token: dotenv::var("METRICS_AUTHORIZATION_TOKEN").ok(),
159163
}
160164
}
161165
}

src/controllers.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ pub mod category;
101101
pub mod crate_owner_invitation;
102102
pub mod keyword;
103103
pub mod krate;
104+
pub mod metrics;
104105
pub mod site_metadata;
105106
pub mod team;
106107
pub mod token;

src/controllers/metrics.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
use crate::controllers::frontend_prelude::*;
2+
use crate::util::errors::{forbidden, not_found, MetricsDisabled};
3+
use conduit::{Body, Response};
4+
use prometheus::{Encoder, TextEncoder};
5+
6+
/// Handles the `GET /api/private/metrics/:kind` endpoint.
7+
pub fn prometheus(req: &mut dyn RequestExt) -> EndpointResult {
8+
let app = req.app();
9+
10+
if let Some(expected_token) = &app.config.metrics_authorization_token {
11+
let provided_token = req
12+
.headers()
13+
.get(header::AUTHORIZATION)
14+
.and_then(|value| value.to_str().ok())
15+
.and_then(|value| value.strip_prefix("Bearer "));
16+
17+
if provided_token != Some(expected_token.as_str()) {
18+
return Err(forbidden());
19+
}
20+
} else {
21+
// To avoid accidentally leaking metrics if the environment variable is not set, prevent
22+
// access to any metrics endpoint if the authorization token is not configured.
23+
return Err(Box::new(MetricsDisabled));
24+
}
25+
26+
let metrics = match req.params()["kind"].as_str() {
27+
"service" => app.service_metrics.gather(&*req.db_read_only()?)?,
28+
"instance" => app.instance_metrics.gather(app)?,
29+
_ => return Err(not_found()),
30+
};
31+
32+
let mut output = Vec::new();
33+
TextEncoder::new().encode(&metrics, &mut output)?;
34+
35+
Ok(Response::builder()
36+
.header(header::CONTENT_TYPE, "text/plain; charset=utf-8")
37+
.header(header::CONTENT_LENGTH, output.len())
38+
.body(Body::from_vec(output))?)
39+
}

src/db.rs

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,19 @@ impl DieselPool {
2424
}
2525
}
2626

27-
pub fn state(&self) -> r2d2::State {
27+
pub fn state(&self) -> PoolState {
2828
match self {
29-
DieselPool::Pool(pool) => pool.state(),
30-
DieselPool::Test(_) => panic!("Cannot get the state of a test pool"),
29+
DieselPool::Pool(pool) => {
30+
let state = pool.state();
31+
PoolState {
32+
connections: state.connections,
33+
idle_connections: state.idle_connections,
34+
}
35+
}
36+
DieselPool::Test(_) => PoolState {
37+
connections: 0,
38+
idle_connections: 0,
39+
},
3140
}
3241
}
3342

@@ -36,6 +45,12 @@ impl DieselPool {
3645
}
3746
}
3847

48+
#[derive(Debug, Copy, Clone)]
49+
pub struct PoolState {
50+
pub connections: u32,
51+
pub idle_connections: u32,
52+
}
53+
3954
#[allow(missing_debug_implementations)]
4055
pub enum DieselPooledConn<'a> {
4156
Pool(r2d2::PooledConnection<ConnectionManager<PgConnection>>),

src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ mod downloads_counter;
4040
pub mod email;
4141
pub mod git;
4242
pub mod github;
43+
mod metrics;
4344
pub mod middleware;
4445
mod publish_rate_limit;
4546
pub mod render;

src/metrics/instance.rs

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
//! This module defines all the instance-level metrics of crates.io.
2+
//!
3+
//! Instance-level metrics are collected separately for each instance of the crates.io application,
4+
//! and are then aggregated at the Prometheus level. They're not suited for service-level metrics
5+
//! (like "how many users are there").
6+
//!
7+
//! There are two ways to update instance-level metrics:
8+
//!
9+
//! * Continuously as things happen in the instance: every time something worth recording happens
10+
//! the application updates the value of the metrics, accessing the metric through
11+
//! `req.app().instance_metrics.$metric_name`.
12+
//!
13+
//! * When metrics are scraped by Prometheus: every `N` seconds Prometheus sends a request to the
14+
//! instance asking what the value of the metrics are, and you can update metrics when that
15+
//! happens by calculating them in the `gather` method.
16+
//!
17+
//! As a rule of thumb, if the metric requires a database query to be updated it's probably a
18+
//! service-level metric, and you should add it to `src/metrics/service.rs` instead.
19+
20+
use crate::util::errors::AppResult;
21+
use crate::{app::App, db::DieselPool};
22+
use prometheus::{proto::MetricFamily, IntCounter, IntGauge, IntGaugeVec};
23+
24+
metrics! {
25+
pub struct InstanceMetrics {
26+
/// Number of idle database connections in the pool
27+
database_idle_conns: IntGaugeVec["pool"],
28+
/// Number of used database connections in the pool
29+
database_used_conns: IntGaugeVec["pool"],
30+
31+
/// Number of requests processed by this instance
32+
pub requests_total: IntCounter,
33+
/// Number of requests currently being processed
34+
pub requests_in_flight: IntGauge,
35+
}
36+
37+
// All instance metrics will be prefixed with this namespace.
38+
namespace: "cratesio_instance",
39+
}
40+
41+
impl InstanceMetrics {
42+
pub(crate) fn gather(&self, app: &App) -> AppResult<Vec<MetricFamily>> {
43+
// Database pool stats
44+
self.refresh_pool_stats("primary", &app.primary_database)?;
45+
if let Some(follower) = &app.read_only_replica_database {
46+
self.refresh_pool_stats("follower", follower)?;
47+
}
48+
49+
Ok(self.registry.gather())
50+
}
51+
52+
fn refresh_pool_stats(&self, name: &str, pool: &DieselPool) -> AppResult<()> {
53+
let state = pool.state();
54+
55+
self.database_idle_conns
56+
.get_metric_with_label_values(&[name])?
57+
.set(state.idle_connections as i64);
58+
self.database_used_conns
59+
.get_metric_with_label_values(&[name])?
60+
.set((state.connections - state.idle_connections) as i64);
61+
62+
Ok(())
63+
}
64+
}

src/metrics/macros.rs

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
pub(super) trait MetricFromOpts: Sized {
2+
fn from_opts(opts: prometheus::Opts) -> Result<Self, prometheus::Error>;
3+
}
4+
5+
#[macro_export]
6+
macro_rules! metrics {
7+
(
8+
$vis:vis struct $name:ident {
9+
$(
10+
#[doc = $help:expr]
11+
$(#[$meta:meta])*
12+
$metric_vis:vis $metric:ident: $ty:ty $([$($label:expr),* $(,)?])?
13+
),* $(,)?
14+
}
15+
namespace: $namespace:expr,
16+
) => {
17+
$vis struct $name {
18+
registry: prometheus::Registry,
19+
$(
20+
$(#[$meta])*
21+
$metric_vis $metric: $ty,
22+
)*
23+
}
24+
impl $name {
25+
$vis fn new() -> Result<Self, prometheus::Error> {
26+
use crate::metrics::macros::MetricFromOpts;
27+
28+
let registry = prometheus::Registry::new();
29+
$(
30+
$(#[$meta])*
31+
let $metric = <$ty>::from_opts(
32+
prometheus::Opts::new(stringify!($metric), $help)
33+
.namespace($namespace)
34+
$(.variable_labels(vec![$($label.into()),*]))?
35+
)?;
36+
$(#[$meta])*
37+
registry.register(Box::new($metric.clone()))?;
38+
)*
39+
Ok(Self {
40+
registry,
41+
$(
42+
$(#[$meta])*
43+
$metric,
44+
)*
45+
})
46+
}
47+
}
48+
impl std::fmt::Debug for $name {
49+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
50+
write!(f, "{}", stringify!($name))
51+
}
52+
}
53+
};
54+
}
55+
56+
#[macro_export]
57+
macro_rules! load_metric_type {
58+
($name:ident as single) => {
59+
use prometheus::$name;
60+
impl crate::metrics::macros::MetricFromOpts for $name {
61+
fn from_opts(opts: prometheus::Opts) -> Result<Self, prometheus::Error> {
62+
$name::with_opts(opts)
63+
}
64+
}
65+
};
66+
($name:ident as vec) => {
67+
use prometheus::$name;
68+
impl crate::metrics::macros::MetricFromOpts for $name {
69+
fn from_opts(opts: prometheus::Opts) -> Result<Self, prometheus::Error> {
70+
$name::new(
71+
opts.clone().into(),
72+
opts.variable_labels
73+
.iter()
74+
.map(|s| s.as_str())
75+
.collect::<Vec<_>>()
76+
.as_slice(),
77+
)
78+
}
79+
}
80+
};
81+
}

src/metrics/mod.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
pub use self::instance::InstanceMetrics;
2+
pub use self::service::ServiceMetrics;
3+
4+
#[macro_use]
5+
mod macros;
6+
7+
mod instance;
8+
mod service;
9+
10+
load_metric_type!(IntGauge as single);
11+
load_metric_type!(IntCounter as single);
12+
load_metric_type!(IntGaugeVec as vec);

src/metrics/service.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
//! This module defines all the service-level metrics of crates.io.
2+
//!
3+
//! Service-level metrics are collected for the whole service, without querying the individual
4+
//! instances of the application. They're not suited for instance-level metrics (like "how many
5+
//! requests were processed" or "how many connections are left in the database pool").
6+
//!
7+
//! Service-level metrics should **never** be updated around the codebase: instead all the updates
8+
//! should happen inside the `gather` method. A database connection is available inside the method.
9+
//!
10+
//! As a rule of thumb, if the metric is not straight up fetched from the database it's probably an
11+
//! instance-level metric, and you should add it to `src/metrics/instance.rs`.
12+
13+
use crate::schema::{crates, versions};
14+
use crate::util::errors::AppResult;
15+
use diesel::{dsl::count_star, prelude::*, PgConnection};
16+
use prometheus::{proto::MetricFamily, IntGauge};
17+
18+
metrics! {
19+
pub struct ServiceMetrics {
20+
/// Number of crates ever published
21+
crates_total: IntGauge,
22+
/// Number of versions ever published
23+
versions_total: IntGauge,
24+
}
25+
26+
// All service metrics will be prefixed with this namespace.
27+
namespace: "cratesio_service",
28+
}
29+
30+
impl ServiceMetrics {
31+
pub(crate) fn gather(&self, conn: &PgConnection) -> AppResult<Vec<MetricFamily>> {
32+
self.crates_total
33+
.set(crates::table.select(count_star()).first(conn)?);
34+
self.versions_total
35+
.set(versions::table.select(count_star()).first(conn)?);
36+
37+
Ok(self.registry.gather())
38+
}
39+
}

0 commit comments

Comments
 (0)