Skip to content

CDRIVER-4500: SDAM structured log and unified test support #1842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
81 commits merged into from
Feb 5, 2025

Conversation

ghost
Copy link

@ghost ghost commented Jan 29, 2025

  • Resolves CDRIVER-4500 (SDAM structured logs + updated unified tests)
  • Resolves CDRIVER-4758 (Prose test for heartbeat event order: /server_discovery_and_monitoring/prose/heartbeat)
  • Resolves CDRIVER-4137 (Missing server close events, now tested by /server_discovery_and_monitoring/unified)
  • Completes outstanding issues in the CDRIVER-3775 epic (libmongoc structured logging)

The central purpose of the patch is to add missing log events for heartbeats, topology, and server updates. Several other changes became necessary:

  • Refactored APM callbacks into a new "log and monitor instance"

    • Previously, APM callbacks were duplicated in many places including each individual topology description instance. This moves both APM and Structured Log callbacks/settings to a new "log and monitor instance" (mongoc_log_and_monitor_instance_t) object with a clear internal interface. It's owned by mongoc_topology_t, and given as an explicit parameter to log and monitor functions.

    • Existing documented thread safety rules about pool settings (At most one write, prior to the first client) are implemented uniformly for both structured logs and APM callbacks. Edit: For compatibility, previously allowed unsafe uses are now still allowed. A new warning message will be reported as an unstructured log error.

  • Refined topology lifecycle events to match standard behavior

    • Server 'opened' events are deferred until after the corresponding topology is 'opened', rather than until the APM callbacks have been set.

    • Added missing end-of-life SDAM events. Before monitoring is closed, topology descriptions will now transition to an "Unknown" state as required. Server monitoring will be closed. Closure events are only emitted if the same log_and_monitor instance has seen a corresponding open.

  • Added test skips for failures corresponding to features that are currently out of scope:

    • CMAP events, especially pool clearing on various error conditions
    • Driver-generated connection IDs (optional in the spec, but some tests assume they are present)
    • Thread entities in unified tests
  • Added new unified test runner features:

    • Support for "closing" some entity types without deleting (For example, to inspect events logged when closing a client)
    • SDAM APM events, and a new eventType filtering implementation
    • New operations: recordTopologyDescription, assertTopologyType, waitForPrimaryChange
    • Added a structured log filter stack (entity_map_log_filter_push/pop) used internally to refine the apparent behavior of waitForEvent
  • Added new structured log items:

    • oid() for a plain ObjectID without hex representation
    • topology_description_as_json() for plain topology descriptions not inside a topology

Micah Scott added 30 commits January 28, 2025 14:58
from commit d795d493c41022cb8ed15006ae5ac5ad85936f40
To support SDAM tests, we need to access entities that have been closed.
…tance

This replaces a few duplicated copies of the APM callbacks with a centralized
copy maintained inside topology_t by mongoc_log_and_monitor_instance_t.
We had oid_as_hex, but this adds a plain bson ObjectId representation.
@ghost ghost requested a review from eramongodb January 30, 2025 02:27
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanations and clarifications. Re-suggesting some changes to coincide with additional motivations outlined in responses below, but otherwise, LGTM.

Re the extra warnings you're reporting, is there any plan to include them in an evergreen check?

No, these are warning I enable for local development (Clang's -Weverything + set of -Wno-* to disable irrelevant/undesirable warnings) to assist with drive-by code improvements and avoid expanding the set of possible warnings which might be observed by users (comparing a diff of total warnings for proposed changes against the base commit).

casting away const in an explicit conversion to (void*) seems to be a common enough pattern in the repository.

It is an unfortunately common pattern. The reason for the -Wcast-qual warning (which is emitted in spite of the explicit cast to void*) is to guard against the possibility of undefined behavior due to accessing a const-qualified object when the non-const void pointer is (expected to be) cast back to an object pointer. Recent code changes has been making efforts to improve the const-correctness of our interfaces even when type-erasure is involved (*_const functions).

Why do we need to disallow stateful filter functions?

The suggestion was based simply on the observation that no current usage of this pointer requires modifiable access to the pointed-to object: it is only being used for pointer equality comparison here and string comparison here. If your preference is to leave open the possibility for modifiable object pointers by using void*, that is fine.

My preference was to avoid the "unsafe" version unless there was a reason to prefer it.

Suggestion was not for performance reasons, but instead to communciate that synchronization is not required in this context (there should only be one thread accessing the to-be-destroyed topology object). That is, TSAN should rightfully warn if there are multiple threads accessing (un)guarded resources at this point.

I declined to document apm_mutex in additional detail here because I'm not convinced the current implementation is correct. [...] For this patch, I tried to preserve the existing behavior without changing it.

The proposed rationale was echoing the one given at the time the mutex was introduced (see: #607 (comment)).

My guess is that, at the time, thread-safe usage of the C Driver API may not have been an actively supported use case, therefore APM callback invocations were assumed to be always single-threaded (from the perspective of a user who registered callbacks) until the introduction of background server monitoring threads. This would explain why no other APM callback invocations are guarded.

This is for skipping events of other event types, when the "eventType" parameter is being used. When there's no "eventType", this never skips events.

I overlooked the specified behavior of eventType event checks. Thank you.

The skipping of cmap events would still be required.

It's been documented in the manpage.

Sorry, I was referencing outdated docs (generated locally). I see the Thread Safety section on the live website.

It's a workaround for allowing applications to see events which would otherwise be emitted earlier than the APM callbacks could be attached.

Thank you for the explanation.

It's not clear to me that an ASSERT here would help. [...] I don't think it would be an improvement to guarantee a crash after opening 2^64 topologies per process.

The proposed assertion is not to prevent collisions, but to avoid undefined behavior due to signed integer overflow during the atomic increment. The assertion would guard against undefined behavior by detecting the soon-to-occur condition and triggering well-defined erroneous behavior instead.

This implies that we could switch to using type-specific event lists, instead of having one event list that we filter.

I am fine with deferring this potential improvement to a separate PR.

/* Before reporting this topology as closed, life cycle rules expect us to close
* all servers and transition to an unknown topology. */
{
mc_shared_tpld td = mc_tpld_take_ref (topology);
Copy link
Contributor

@eramongodb eramongodb Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider suggestion once more, not for performance, but to communicate thread safety expectations within this context (synchronization "should not" be necessary here; allow TSAN to detect violation).

Comment on lines 25 to 27
static int64_t serial_number_atomic = 0;
mongoc_log_and_monitor_serial_t result =
1 + mcommon_atomic_int64_fetch_add (&serial_number_atomic, 1, mcommon_memory_order_seq_cst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider suggestion once more, not to prevent collisions, but to avoid undefined behavior due to signed integer overflow (during the atomic increment) by triggering well-defined erroneous behavior (assertion) instead before it can occur.

@ghost
Copy link
Author

ghost commented Jan 30, 2025 via email

@ghost
Copy link
Author

ghost commented Jan 30, 2025

I'm going to make the mongoc_log_and_monitor_serial_t a bson_oid_t so we can use the centralized copy of this technically-undefined behavior that exists in _bson_context_set_oid_seq64.

Micah Scott added 4 commits February 3, 2025 14:22
This reverts 'opening' back to a single boolean flag, and stops trying to preserve the old behavior where unset callbacks would defer the apparent beginning of server monitoring. Feedback is that the old behavior was unintended, and it's better not to attempt to retain it.
@ghost
Copy link
Author

ghost commented Feb 4, 2025

Kevin and I discussed the changes to lazy-opening. The old behavior where server open could be deferred due to an unset callback is not necessarily intended, and not worth preserving. Without this requirement, we can avoid making lazy-opening more complicated. Now version_id is gone, and 'opened' is back to a boolean flag.

@ghost
Copy link
Author

ghost commented Feb 4, 2025

Waiting to merge until I can get a clean CI run, which is currently blocked on 8.0.5-rc1

@ghost ghost merged commit a91d6f6 into mongodb:master Feb 5, 2025
43 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants