Skip to content

feat!: metrics refactor #3262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
May 5, 2025
Merged

feat!: metrics refactor #3262

merged 40 commits into from
May 5, 2025

Conversation

Frando
Copy link
Member

@Frando Frando commented Apr 8, 2025

Description

Depends on n0-computer/net-tools#20

Until now, we were using a superglobal static iroh_metrics::core::Core struct to collect metrics into. This allowed us to use macros to track metrics from anywhere in the codebase. However, this also made it impossible to collect metrics per endpoint, which is what you want usually as soon as you have more than one endpoint in your app.

This PR builds on n0-computer/iroh-metrics#15, n0-computer/iroh-metrics#22, and n0-computer/iroh-metrics#23. It removes the global metrics collection from all crates in the iroh repository. Instead, we now create and pass metrics collector structs to all places where we need to collect metrics.

This PR disables the static_core feature from iroh-metrics, which means the macros for superglobal metrics collection are not available anymore. This is good, because otherwise we could easily miss metrics not tracked onto the proper metrics collector.

This PR also updates iroh-dns-server and iroh-relay to use manual metrics collection.

While this means that we have to pass our metrics structs to more places, it also makes metrics collection more visible, and we can now also split the metrics structs further easily if we want to separate concerns more.

This PR should not change anything apart from metrics collection. Most places are straightforward conversions from the macros to methods on the metrics collectors. At a few places, logic was changed slightly to move metrics collection a layer up to save a few clones.

Breaking Changes

  • All metrics structs (iroh::metrics::{MagicsockMetrics, PortmapMetrics, NetReportMetrics}) now implement MetricsGroup from the new version 0.34 of iroh-metrics and no longer implement traits from [email protected].
  • Metrics are no longer registered onto the static superglobal Core. iroh does not use static_core feature of iroh-metrics. Metrics are now exposed from the subsystems that track them, see e.g. Endpoint::metrics.

Several methods now take a Metrics argument. You can always pass Default::default if you don't want to unify metrics tracking with other sections.

iroh

  • iroh::metrics::{MagicsockMetrics, NetReportMetrics, PortmapMetrics} all are now marked non_exhaustive, and implement iroh_metrics::MetricsGroup from [email protected] and no longer implement iroh_metrics::Metric from [email protected]. They also no longer implement Clone (put them into an Arc for cloning instead).

  • iroh::net_report::Client::new now takes iroh::net_report::metrics::Metrics as forth argument

iroh-dns-server

  • iroh_dns_server::server::Server::spawn now takes Metrics as third argument
  • iroh_dns_server::ZoneStore::persistent now takes Metrics as third argument
  • iroh_dns_server::ZoneStore::in_memory now takes Metrics as third argument
  • iroh_dns_server::ZoneStore::new now takes Metrics as third argument
  • iroh_dns_server::state::AppState now has a public metrics: Metrics field
  • iroh_dns_server::dns::DnsHandler::new now takes Metrics as third argument
  • function iroh_dns_server::metrics::init_metrics is removed

iroh-relay

  • iroh_relay::metrics::{StunMetrics, Metrics} all are now marked non_exhaustive, and implement iroh_metrics::MetricsGroup from [email protected] and no longer implement iroh_metrics::Metric from [email protected]. They also no longer implement Clone (put them into an Arc for cloning instead).

Notes & open questions

Change checklist

Copy link

github-actions bot commented Apr 8, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3262/docs/iroh/

Last updated: 2025-05-05T17:49:33Z

Copy link

github-actions bot commented Apr 8, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 1b83124

@@ -99,10 +105,10 @@ impl Stream for MappedStream {
if !self.was_direct_before && matches!(new_conn_type, ConnectionType::Direct(_))
{
self.was_direct_before = true;
inc!(MagicsockMetrics, connection_became_direct);
became_direct = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I don't think you want to change this logic. You can totally become direct more than once for the same connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best to keep the refactor faithull to the current behaviour, including all the bugs. and do the bug fixes in another PR(s). That will make it much easier to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this changes the logic.
Nor do I think the logic had bugs. It executes the body of this if only once over the course of was_direct_before's lifetime.

Copy link
Member Author

@Frando Frando Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was not to fix any bugs or change existing behavior, only to not clone the the metrics once more into each connection type stream, but instead handle the metrics update in the top level actor. So if I changed anything of substance, it was my bad and I will revert.

In my reading: Before - increment connection_became_direct if the if-statement in line 99/105 triggers. After: pass became_direct: true to the outer actor if the if-statement in line 99/105 triggers, and then increment connection_became_direct in the outer actor.

@@ -869,7 +876,7 @@ mod tests {
let tls_config = make_tls_config();

// start server
let mut server = ServerBuilder::new("127.0.0.1:0".parse().unwrap())
let mut server = ServerBuilder::new("127.0.0.1:0".parse().unwrap(), Default::default())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nicer to make the metrics be a call

eg

// default
ServerBuilder::new("127.0.0.1:0".parse().unwrap())
  .spawn();
// explicit
ServerBuilder::new("127.0.0.1:0".parse().unwrap())
  .metrics(metrics)
  .spawn();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean making it optional though, no? Currently metrics are required, which is why I added this on the constructor.
I can make them optional, it'll be if let Some(metrics) = self.metrics { metrics.foo.inc(); } then at each place where metrics are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to make them optional, just construct them as default if none are passed in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed that to use the builder method instead of Default in the constructor.

///
/// See [`crate::endpoint::Endpoint::metrics`] for details.
#[derive(Default, Debug, Clone)]
pub struct EndpointMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should mark all the metrics structs as non_exhaustive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -285,7 +288,9 @@ impl ActiveRelayActor {
///
/// Primarily switches between the dialing and connected states.
async fn run(mut self) -> Result<()> {
inc!(MagicsockMetrics, num_relay_conns_added);
// TODO(frando): decide what this metric means, it's either wrong here or in node_state.rs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make an issue for this, otherwise I fear this will get lost very quickly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// we potentially have a relay connection to the node
inc!(MagicsockMetrics, num_relay_conns_added);
}
// TODO(frando): I don't think we need to track the `num_relay_conns_added`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Frando Frando changed the title feat: metrics refactor feat!: metrics refactor Apr 16, 2025
@dignifiedquire dignifiedquire added this to the 0.35.0 milestone May 1, 2025
@Frando Frando requested a review from dignifiedquire May 2, 2025 12:19
@Frando Frando requested a review from flub May 5, 2025 07:48
dignifiedquire pushed a commit to n0-computer/net-tools that referenced this pull request May 5, 2025
## Description

Needed for n0-computer/iroh#3262

Updates `iroh-metrics` to 0.34, and makes the metrics collection
non-global. Metrics are now collected per portmapper service.

## Breaking Changes

* `portmapper::metrics::Metrics` now implements `MetricsGroup` from
`[email protected]`, and no longer implements `Metric` from
`[email protected]`.

## Notes & open questions
<!-- Any notes, remarks or open questions you have to make about the PR.
-->
## Change checklist
- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
@dignifiedquire dignifiedquire added this pull request to the merge queue May 5, 2025
Merged via the queue into main with commit 1957ca8 May 5, 2025
25 of 26 checks passed
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh May 5, 2025
Frando added a commit to n0-computer/iroh-gossip that referenced this pull request May 7, 2025
## Description

Adapts iroh-metrics for n0-computer/iroh-metrics#15 and n0-computer/iroh#3262

Depends on n0-computer/iroh#3262

## Breaking Changes


* `metrics::Metrics` now implements `MetricsGroup` from the  ì[email protected]`
* Metrics are no longer tracked into the `static_core` from `iroh-metrics`, but instead are tracked per `Gossip` and exposed via `Gossip::metrics`
* `proto::state::State::handle` now takes `Option<&Metrics>` as new 4th parameter

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR. -->

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
Frando added a commit to n0-computer/iroh-blobs that referenced this pull request May 8, 2025
…ics collection (#85)

## Description

Updates metrics tracking to the new non-global tracking.

Metrics are tracked per downloader. We already only track metrics within
the downloader, so this was rather straightforward.

The tracking of request stats was changed to be tracked once the
download task completes, as this seemed easier to me. Needs a careful
review, but I think the change is sound (i.e. no change).

Depends on n0-computer/iroh#3262

## Breaking Changes

* `metrics::Metrics` now implements `MetricsGroup` from the recent
ìroh-metrics` release (TODO: fill in version after release)
* Metrics are no longer tracked into the `static_core` from
`iroh-metrics`, but instead are tracked per `Downloder` and exposed via
`Downloader::metrics`

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.

---------

Co-authored-by: Ruediger Klaehn <[email protected]>
ramfox pushed a commit to n0-computer/iroh-docs that referenced this pull request May 12, 2025
## Description

Updates metrics tracking to the new non-global tracking.

Metrics are tracked per sync actor.

We previously tracked a few metrics within the replica, this is removed
and instead tracked from the sync actor now.

Depends on n0-computer/iroh#3262
Depends on n0-computer/iroh-blobs#85
Depends on n0-computer/iroh-gossip#58

## Breaking Changes

* `metrics::Metrics` now implements `MetricsGroup` from the recent
ìroh-metrics` release (TODO: fill in version after release)
* Metrics are no longer tracked into the `static_core` from
`iroh-metrics`, but instead are tracked per `Engine` and exposed via
`Engine::metrics`

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
@dignifiedquire dignifiedquire deleted the Frando/metrics branch May 14, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants