-
Notifications
You must be signed in to change notification settings - Fork 525
feat: span processor api refactor #2962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: span processor api refactor #2962
Conversation
Currently there is no way to read the span data without cloning it. This causes performance issues when the span for span processors that need to read the span data. This commit introduces a new trait `ReadableSpan` in the sdk implemented by SDK Spans that allows to read the span data without cloning it.
This API allows to mutations of the span when it is ending. It's marked as on development in the spec, but it is useful for span obfuscation for example, which needs to done after attributes can added to the span anymore.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2962 +/- ##
======================================
Coverage 81.0% 81.1%
======================================
Files 126 126
Lines 24753 25039 +286
======================================
+ Hits 20070 20323 +253
- Misses 4683 4716 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @paullegranddc thanks for opening the PR!
I've had a quick zoom through now and have some comments and queries.
This feels like a sensible way of addressing the problems you mentioned (and highlighted in the examples 👍 ). Of course it will break the existing on_end
, but as we're not stable here, yet and there's a clear upshot, from my perspective that seems reasonable. I think @cijothomas will also have helpful opinions here - Cijo?
Adding on_ending
seems like a no brainer.
It would be great to look at some big downstream projects (tracing-opentelemetry
comes to mind), and check if they are impacted. It would also be great to have some perf numbers from benchmarks touching the path through ensure_ended_and_exported
which you may already have handy?
/// `on_end` is called after a `Span` is ended (i.e., the end timestamp is | ||
/// already set). This method is called synchronously within the `Span::end` | ||
/// API, therefore it should not block or throw an exception. | ||
/// TODO - This method should take reference to `SpanData` | ||
fn on_end(&self, span: SpanData); | ||
fn on_end(&self, span: &mut FinishedSpan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting this is a breaking change
@@ -4,10 +4,10 @@ use opentelemetry::trace::{ | |||
SpanContext, SpanId, SpanKind, Status, TraceFlags, TraceId, TraceState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run the benchmark suite to do a performance regression against main
? Would be great to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've run the existing BatchSpanProcessor
benchmarks.
With experimental_span_processor_on_ending
not enabled, the benchmark show no detectable changes in performance.
If the experimental_span_processor_on_ending
feature is enabled, there is a slight cost (5%) which I believe comes from the fact we have to clone the tracer field, because we have to iterate over the span processor list through a shared reference, while passing the span mutably. Thus we have to clone the tracer associated with the span so we can split the ownership (it's in an Arc
cell so this is not that expensive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing a benchmark? It seems like some of your changes should improve performance -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some benchmarks running multiple span processors were missing, I added them un this PR. The comparison between the main branch and this branch for the benchmarks is in this PR description.
Baseline - main branch:
SpanProcessorApi/0_processors
time: [339.66 ns 340.56 ns 341.47 ns]
SpanProcessorApi/1_processors
time: [373.10 ns 374.36 ns 375.60 ns]
SpanProcessorApi/2_processors
time: [803.10 ns 804.99 ns 807.03 ns]
SpanProcessorApi/4_processors
time: [1.2096 µs 1.2137 µs 1.2179 µs]
Candidate - paullegranddc:paullgdc/sdk/span_processor_api_refactor:
SpanProcessorApi/0_processors
time: [385.15 ns 386.14 ns 387.25 ns]
SpanProcessorApi/1_processors
time: [385.73 ns 387.17 ns 388.85 ns]
SpanProcessorApi/2_processors
time: [384.84 ns 385.66 ns 386.50 ns]
SpanProcessorApi/4_processors
time: [386.78 ns 388.17 ns 389.58 ns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is exactly what I was hoping to see :D
Adds a benchmark creating a span and dropping it, without an exporter. The goal is to estimate the cost of the SDK running without the exporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM modulo the current builds passing. @cijothomas wdyt?
|
||
impl FinishedSpan { | ||
/// Creates a new `FinishedSpan` with the given span data. | ||
pub fn new(span_data: crate::trace::SpanData) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement a From
between those two?
opentelemetry::otel_error!(name: "FinishedSpan.ConsumeTwice", message = "consume called twice on FinishedSpan in the same span processor"); | ||
} | ||
self.try_consume() | ||
.expect("Span data has already been consumed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just try_consume
and throw error if it's already consumed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be calling expect at all? This feels right on the line with "let's not panic in the library code", as it highlights an issue with a span processor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My philosophy on this is that in the end, if this panics it's because of an API misuse. Not because of wrong assumptions on the data manipulated by the program which I understand coding defensively against.
As someone implementing a span processor, I'd rather crash immediately if I misuse the API which is surely going to be caught in my tests.
Having a try_consume
method gives you the freedom to handle the misuse if you want, but in practice as someone writing code against this API I'd rather not handling this mode of failure makes sense.
Design discussion issue (if applicable)
#2940
Changes
This PR refactors the SpanProcessor API.
The API currently has the following issues:
export_data
, which completely copies the dataSpanProcessor::on_end
API means owned span data has to be passed to each span processor, even if it does nothing with it.One way that was explored to fix number 2. was to pass a
&mut Span
toon_end
do processor can use std::mem::take to consume data without copying, but this is not ideal for at least two reasons:on_end
, and modifications done in one processor should not impact data passed to other processorsThis PR thus proposes the following changes changes to the API:
Introduce a
ReadableSpan
trait, implemented for SdkSpan.This trait define getters for span fields so we can read the span without copying or consuming data. This is inspired by other Otel libraries such as Java which has a similar interface.
Modify
on_end
to pass it aFinishedSpan
.FinishedSpan
is a new abstraction around a span that has been closed. It implements theReadableSpan
trait so we can read from it without consuming the data.if a span processor needs to get an owned SpanData on it. This will do one of two things:
This design
on_end
on_end
Benchmarking
A benchmark running creating a span and dropping it, invoking the
on_start
,on_ending
andon_end
methods for a varying number of span processors has been added.On main, the cost of a span lifecycle is proportional to the number of span processors, because the
Clone
operation on span data is as expensive as creating the span.After this PR, since span processors don't clone data unless they need it, the cost is constant regardless of the number of span processors.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes