-
Notifications
You must be signed in to change notification settings - Fork 454
CDRIVER-4500: SDAM structured log and unified test support #1842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
from commit d795d493c41022cb8ed15006ae5ac5ad85936f40
To support SDAM tests, we need to access entities that have been closed.
…tance This replaces a few duplicated copies of the APM callbacks with a centralized copy maintained inside topology_t by mongoc_log_and_monitor_instance_t.
We had oid_as_hex, but this adds a plain bson ObjectId representation.
Co-authored-by: Ezra Chung <[email protected]>
Co-authored-by: Ezra Chung <[email protected]>
Co-authored-by: Ezra Chung <[email protected]>
Relocate deprecation notice to upcoming 1.30.0 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanations and clarifications. Re-suggesting some changes to coincide with additional motivations outlined in responses below, but otherwise, LGTM.
Re the extra warnings you're reporting, is there any plan to include them in an evergreen check?
No, these are warning I enable for local development (Clang's -Weverything
+ set of -Wno-*
to disable irrelevant/undesirable warnings) to assist with drive-by code improvements and avoid expanding the set of possible warnings which might be observed by users (comparing a diff of total warnings for proposed changes against the base commit).
casting away const in an explicit conversion to (void*) seems to be a common enough pattern in the repository.
It is an unfortunately common pattern. The reason for the -Wcast-qual
warning (which is emitted in spite of the explicit cast to void*
) is to guard against the possibility of undefined behavior due to accessing a const-qualified object when the non-const void pointer is (expected to be) cast back to an object pointer. Recent code changes has been making efforts to improve the const-correctness of our interfaces even when type-erasure is involved (*_const
functions).
Why do we need to disallow stateful filter functions?
The suggestion was based simply on the observation that no current usage of this pointer requires modifiable access to the pointed-to object: it is only being used for pointer equality comparison here and string comparison here. If your preference is to leave open the possibility for modifiable object pointers by using void*
, that is fine.
My preference was to avoid the "unsafe" version unless there was a reason to prefer it.
Suggestion was not for performance reasons, but instead to communciate that synchronization is not required in this context (there should only be one thread accessing the to-be-destroyed topology object). That is, TSAN should rightfully warn if there are multiple threads accessing (un)guarded resources at this point.
I declined to document apm_mutex in additional detail here because I'm not convinced the current implementation is correct. [...] For this patch, I tried to preserve the existing behavior without changing it.
The proposed rationale was echoing the one given at the time the mutex was introduced (see: #607 (comment)).
My guess is that, at the time, thread-safe usage of the C Driver API may not have been an actively supported use case, therefore APM callback invocations were assumed to be always single-threaded (from the perspective of a user who registered callbacks) until the introduction of background server monitoring threads. This would explain why no other APM callback invocations are guarded.
This is for skipping events of other event types, when the "eventType" parameter is being used. When there's no "eventType", this never skips events.
I overlooked the specified behavior of eventType
event checks. Thank you.
The skipping of cmap events would still be required.
It's been documented in the manpage.
Sorry, I was referencing outdated docs (generated locally). I see the Thread Safety section on the live website.
It's a workaround for allowing applications to see events which would otherwise be emitted earlier than the APM callbacks could be attached.
Thank you for the explanation.
It's not clear to me that an ASSERT here would help. [...] I don't think it would be an improvement to guarantee a crash after opening 2^64 topologies per process.
The proposed assertion is not to prevent collisions, but to avoid undefined behavior due to signed integer overflow during the atomic increment. The assertion would guard against undefined behavior by detecting the soon-to-occur condition and triggering well-defined erroneous behavior instead.
This implies that we could switch to using type-specific event lists, instead of having one event list that we filter.
I am fine with deferring this potential improvement to a separate PR.
/* Before reporting this topology as closed, life cycle rules expect us to close | ||
* all servers and transition to an unknown topology. */ | ||
{ | ||
mc_shared_tpld td = mc_tpld_take_ref (topology); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider suggestion once more, not for performance, but to communicate thread safety expectations within this context (synchronization "should not" be necessary here; allow TSAN to detect violation).
static int64_t serial_number_atomic = 0; | ||
mongoc_log_and_monitor_serial_t result = | ||
1 + mcommon_atomic_int64_fetch_add (&serial_number_atomic, 1, mcommon_memory_order_seq_cst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider suggestion once more, not to prevent collisions, but to avoid undefined behavior due to signed integer overflow (during the atomic increment) by triggering well-defined erroneous behavior (assertion) instead before it can occur.
On Thu, Jan 30, 2025 at 8:24 AM Ezra Chung ***@***.***> wrote:
> My preference was to avoid the "unsafe" version unless there was a reason to prefer it.
Suggestion was not for performance reasons, but instead to communciate that synchronization is not required in this context (there should only be one thread accessing the to-be-destroyed topology object). That is, TSAN should rightfully warn if there are multiple threads accessing (un)guarded resources at this point.
Ahh, gotcha. Good idea. I'll make a note of that in the comments too.
I declined to document apm_mutex in additional detail here because I'm not convinced the current implementation is correct. [...] For this patch, I tried to preserve the existing behavior without changing it.
The proposed rationale was echoing the one given at the time the mutex was introduced (see: #607 (comment)).
My guess is that, at the time, thread-safe usage of the C Driver API may not have been an actively supported use case, therefore APM callback invocations were assumed to be always single-threaded (from the perspective of a user who registered callbacks) until the introduction of background server monitoring threads. This would explain why no other APM callback invocations are guarded.
Ok. I'll add an explanation that references the original motive plus
the potential pitfall.
> It's not clear to me that an ASSERT here would help. [...] I don't think it would be an improvement to guarantee a crash after opening 2^64 topologies per process.
The proposed assertion is not to prevent collisions, but to avoid undefined behavior due to signed integer overflow during the atomic increment. The assertion would guard against undefined behavior by detecting the soon-to-occur condition and triggering well-defined erroneous behavior instead.
I understand, I'm just not sure it's an improvement. I'll think about
a better solution. My point of view is that UB in this case is going
to be theoretical only, and we are trading a higher level of purity
with respect to the C spec for an actual runtime fault. In any case
this is far into the weeds with unlikely events. The incrementing
counter could be exchanged with any other locally unique identifier.
> This implies that we could switch to using type-specific event lists, instead of having one event list that we filter.
I am fine with deferring this potential improvement to a separate PR.
Sure. I'm not sure it's necessary to prioritize this, the unified test
runner could really use a more thorough overhaul.
Thanks,
--micah
|
I'm going to make the mongoc_log_and_monitor_serial_t a bson_oid_t so we can use the centralized copy of this technically-undefined behavior that exists in _bson_context_set_oid_seq64. |
This reverts 'opening' back to a single boolean flag, and stops trying to preserve the old behavior where unset callbacks would defer the apparent beginning of server monitoring. Feedback is that the old behavior was unintended, and it's better not to attempt to retain it.
Kevin and I discussed the changes to lazy-opening. The old behavior where server open could be deferred due to an unset callback is not necessarily intended, and not worth preserving. Without this requirement, we can avoid making lazy-opening more complicated. Now version_id is gone, and 'opened' is back to a boolean flag. |
Waiting to merge until I can get a clean CI run, which is currently blocked on 8.0.5-rc1 |
The central purpose of the patch is to add missing log events for heartbeats, topology, and server updates. Several other changes became necessary:
Refactored APM callbacks into a new "log and monitor instance"
Previously, APM callbacks were duplicated in many places including each individual topology description instance. This moves both APM and Structured Log callbacks/settings to a new "log and monitor instance" (mongoc_log_and_monitor_instance_t) object with a clear internal interface. It's owned by mongoc_topology_t, and given as an explicit parameter to log and monitor functions.
Existing documented thread safety rules about pool settings (At most one write, prior to the first client) are implemented uniformly for both structured logs and APM callbacks. Edit: For compatibility, previously allowed unsafe uses are now still allowed. A new warning message will be reported as an unstructured log error.
Refined topology lifecycle events to match standard behavior
Server 'opened' events are deferred until after the corresponding topology is 'opened', rather than until the APM callbacks have been set.
Added missing end-of-life SDAM events. Before monitoring is closed, topology descriptions will now transition to an "Unknown" state as required. Server monitoring will be closed. Closure events are only emitted if the same log_and_monitor instance has seen a corresponding open.
Added test skips for failures corresponding to features that are currently out of scope:
Added new unified test runner features:
Added new structured log items: