Skip to content

fix(opentelemetry): Fix span & sampling propagation #11092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

mydea
Copy link
Member

@mydea mydea commented Mar 14, 2024

OK, this was a tricky one, but I think it now works as expected.

This PR fixes to fundamental issues with sampling & propagation that were uncovered by @Lms24 & myself while trying to use OTEL for remix & sveltekit:

  1. continueTrace updates the propagation context, but if there is an active parent span (even a remote one) this is ignored.
  2. Sampling inheritance did not work as expected, due to the fact that OTEL spans cannot differentiate between sampled=false (sampled to be not recorded) and sampled=undefined (no sampling decision yet).

Update to continueTrace & trace propagation

While my first instinct was to ensure that in the trace methods, if we have remote span we ignore it and look at the propagation context, this has a bunch of problems - because it means we can run out of sync, if this is set from outside, etc.

So instead, I now provide a custom continueTrace method from @sentry/opentelemetry & @sentry/node which should be used instead of the core one in meta SDKs. This method will, in addition to updating the propagation context, also create a remote span with the passed in data, and make it the active span in the callback.

Then, I updated the otel start span APIs to always use that, if it exists (which was already the behavior we had), PLUS also added behavior that if there is no active span at all (not even a remote one), then we look at the propagation context.

Update to sampling inheritance

Previously, we basically did the following:

const sampled: Boolean | undefined = spanContext.traceFlags === TraceFlags.SAMPLED; 
// this will always be true or false, never undefined

Which means that if we create a remote span from a minimal propagation context:

// This could be a generated propagation context from a scope
const propagationContext = { spanId: 'xxx', traceId: 'yyy' };

const spanContext: SpanContext = {
  sampled: propagationContext.sampled ? TraceFlags.SAMPLED : TraceFlags.NONE,
}

We would later always get sampled: false, and inherit this decision for all downstream spans - instead of treating it as undefined, and going through the sampler, as we actually want it to.

In order to "solve" this, I added a new trace state SENTRY_TRACE_STATE_SAMPLED_NOT_RECORDING, which we set if we know this is actually sampled: false, and not just unset.

Then, based on this we can interpret sampled as being false or undefined, respectively.

This is a bit hacky but should work - it means that if we get a sampling decision from outside we'll treat it as undefined, which is OK I would say. Our own sampler will set this correctly so we inherit correctly as well, and our propagator does so too.

Copy link
Member

@Lms24 Lms24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, This PR fixes my failing SvelteKit tests (mostly) - #11075

There's still one thing that's not fixed (edge case when we get a sentry-trace header but no baggage header) but I'm not yet sure what is causing it and also need to test the same scenario in a different place (the handle hook). This is definitely an improvement already so ✅

l: Can we add a note to the continueTrace function from core reminding us to not use it in Node-based SDKs?

@mydea mydea merged commit 0f1ea2f into develop Mar 14, 2024
@mydea mydea deleted the fn/otel-sampling-propagation branch March 14, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants