Skip to content

feat: span processor api refactor #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

paullegranddc
Copy link
Contributor

@paullegranddc paullegranddc commented May 13, 2025

Design discussion issue (if applicable)

#2940

Changes

This PR refactors the SpanProcessor API.

The API currently has the following issues:

  1. It's not possible to read a span fields in a processor without calling export_data, which completely copies the data
  2. The SpanProcessor::on_end API means owned span data has to be passed to each span processor, even if it does nothing with it.

One way that was explored to fix number 2. was to pass a &mut Span to on_end do processor can use std::mem::take to consume data without copying, but this is not ideal for at least two reasons:

  • The otel sdk spec specifically forbids mutation of the span in on_end, and modifications done in one processor should not impact data passed to other processors
  • If there are multiple processors, and any processor but the last one grabs the data and leaves the span empty, this will cause bugs dependent on span processor ordering...

This PR thus proposes the following changes changes to the API:

Introduce a ReadableSpan trait, implemented for SdkSpan.

This trait define getters for span fields so we can read the span without copying or consuming data. This is inspired by other Otel libraries such as Java which has a similar interface.

Modify on_end to pass it a FinishedSpan.

FinishedSpan is a new abstraction around a span that has been closed. It implements the ReadableSpan trait so we can read from it without consuming the data.
if a span processor needs to get an owned SpanData on it. This will do one of two things:

  • if this is the last span processor that will be invoked, it gives ownership of the span data without copying
  • otherwise it copies the span data

This design

  1. allows not copying in span processors that don't need owned data during on_end
  2. doesn't expose mutable spans in on_end
  3. eludes the span data copy if the exporter is the last span processor (which is usually the case).

Benchmarking

A benchmark running creating a span and dropping it, invoking the on_start, on_ending and on_end methods for a varying number of span processors has been added.

Baseline - main branch:

SpanProcessorApi/0_processors
    time:   [339.66 ns 340.56 ns 341.47 ns]
SpanProcessorApi/1_processors
    time:   [373.10 ns 374.36 ns 375.60 ns]
SpanProcessorApi/2_processors
    time:   [803.10 ns 804.99 ns 807.03 ns]
SpanProcessorApi/4_processors
    time:   [1.2096 µs 1.2137 µs 1.2179 µs]

Candidate - paullegranddc:paullgdc/sdk/span_processor_api_refactor:

SpanProcessorApi/0_processors
    time:   [385.15 ns 386.14 ns 387.25 ns]
SpanProcessorApi/1_processors
    time:   [385.73 ns 387.17 ns 388.85 ns]
SpanProcessorApi/2_processors
    time:   [384.84 ns 385.66 ns 386.50 ns]
SpanProcessorApi/4_processors
    time:   [386.78 ns 388.17 ns 389.58 ns]

On main, the cost of a span lifecycle is proportional to the number of span processors, because the Clone operation on span data is as expensive as creating the span.

After this PR, since span processors don't clone data unless they need it, the cost is constant regardless of the number of span processors.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Currently there is no way to read the span data without cloning it.
This causes performance issues when the span for span
processors that need to read the span data.

This commit introduces a new trait `ReadableSpan` in the sdk
implemented by SDK Spans that allows to read the span data without
cloning it.
This API allows to mutations of the span when it is ending.
It's marked as on development in the spec, but it is useful for
span obfuscation for example, which needs to done after
attributes can added to the span anymore.
@paullegranddc paullegranddc requested a review from a team as a code owner May 13, 2025 16:00
Copy link

codecov bot commented May 13, 2025

Codecov Report

Attention: Patch coverage is 84.36578% with 53 lines in your changes missing coverage. Please review.

Project coverage is 81.1%. Comparing base (8882c31) to head (343aba2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/trace/span.rs 83.0% 53 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main   #2962    +/-   ##
======================================
  Coverage   81.0%   81.1%            
======================================
  Files        126     126            
  Lines      24753   25039   +286     
======================================
+ Hits       20070   20323   +253     
- Misses      4683    4716    +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@scottgerring scottgerring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @paullegranddc thanks for opening the PR!
I've had a quick zoom through now and have some comments and queries.

This feels like a sensible way of addressing the problems you mentioned (and highlighted in the examples 👍 ). Of course it will break the existing on_end, but as we're not stable here, yet and there's a clear upshot, from my perspective that seems reasonable. I think @cijothomas will also have helpful opinions here - Cijo?

Adding on_ending seems like a no brainer.

It would be great to look at some big downstream projects (tracing-opentelemetry comes to mind), and check if they are impacted. It would also be great to have some perf numbers from benchmarks touching the path through ensure_ended_and_exported which you may already have handy?

/// `on_end` is called after a `Span` is ended (i.e., the end timestamp is
/// already set). This method is called synchronously within the `Span::end`
/// API, therefore it should not block or throw an exception.
/// TODO - This method should take reference to `SpanData`
fn on_end(&self, span: SpanData);
fn on_end(&self, span: &mut FinishedSpan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this is a breaking change

@@ -4,10 +4,10 @@ use opentelemetry::trace::{
SpanContext, SpanId, SpanKind, Status, TraceFlags, TraceId, TraceState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run the benchmark suite to do a performance regression against main ? Would be great to include.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've run the existing BatchSpanProcessor benchmarks.
With experimental_span_processor_on_ending not enabled, the benchmark show no detectable changes in performance.

If the experimental_span_processor_on_ending feature is enabled, there is a slight cost (5%) which I believe comes from the fact we have to clone the tracer field, because we have to iterate over the span processor list through a shared reference, while passing the span mutably. Thus we have to clone the tracer associated with the span so we can split the ownership (it's in an Arc cell so this is not that expensive)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing a benchmark? It seems like some of your changes should improve performance -->

https://github.com/open-telemetry/opentelemetry-rust/pull/2962/files/1d3e6d89637daebad09a9f792ae095f63be48e1a#r2097787712

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some benchmarks running multiple span processors were missing, I added them un this PR. The comparison between the main branch and this branch for the benchmarks is in this PR description.

Baseline - main branch:

SpanProcessorApi/0_processors
    time:   [339.66 ns 340.56 ns 341.47 ns]
SpanProcessorApi/1_processors
    time:   [373.10 ns 374.36 ns 375.60 ns]
SpanProcessorApi/2_processors
    time:   [803.10 ns 804.99 ns 807.03 ns]
SpanProcessorApi/4_processors
    time:   [1.2096 µs 1.2137 µs 1.2179 µs]

Candidate - paullegranddc:paullgdc/sdk/span_processor_api_refactor:

SpanProcessorApi/0_processors
    time:   [385.15 ns 386.14 ns 387.25 ns]
SpanProcessorApi/1_processors
    time:   [385.73 ns 387.17 ns 388.85 ns]
SpanProcessorApi/2_processors
    time:   [384.84 ns 385.66 ns 386.50 ns]
SpanProcessorApi/4_processors
    time:   [386.78 ns 388.17 ns 389.58 ns]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is exactly what I was hoping to see :D

Adds a benchmark creating a span and dropping it, without an exporter.
The goal is to estimate the cost of the SDK running without the exporter
Copy link
Contributor

@scottgerring scottgerring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM modulo the current builds passing. @cijothomas wdyt?


impl FinishedSpan {
/// Creates a new `FinishedSpan` with the given span data.
pub fn new(span_data: crate::trace::SpanData) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement a From between those two?

opentelemetry::otel_error!(name: "FinishedSpan.ConsumeTwice", message = "consume called twice on FinishedSpan in the same span processor");
}
self.try_consume()
.expect("Span data has already been consumed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just try_consume and throw error if it's already consumed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be calling expect at all? This feels right on the line with "let's not panic in the library code", as it highlights an issue with a span processor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My philosophy on this is that in the end, if this panics it's because of an API misuse. Not because of wrong assumptions on the data manipulated by the program which I understand coding defensively against.
As someone implementing a span processor, I'd rather crash immediately if I misuse the API which is surely going to be caught in my tests.

Having a try_consume method gives you the freedom to handle the misuse if you want, but in practice as someone writing code against this API I'd rather not handling this mode of failure makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants