CDRIVER-4500: SDAM structured log and unified test support #1842

ghost · 2025-01-29T02:57:45Z

Resolves CDRIVER-4500 (SDAM structured logs + updated unified tests)
Resolves CDRIVER-4758 (Prose test for heartbeat event order: /server_discovery_and_monitoring/prose/heartbeat)
Resolves CDRIVER-4137 (Missing server close events, now tested by /server_discovery_and_monitoring/unified)
Completes outstanding issues in the CDRIVER-3775 epic (libmongoc structured logging)

The central purpose of the patch is to add missing log events for heartbeats, topology, and server updates. Several other changes became necessary:

Refactored APM callbacks into a new "log and monitor instance"
- Previously, APM callbacks were duplicated in many places including each individual topology description instance. This moves both APM and Structured Log callbacks/settings to a new "log and monitor instance" (mongoc_log_and_monitor_instance_t) object with a clear internal interface. It's owned by mongoc_topology_t, and given as an explicit parameter to log and monitor functions.
- Existing documented thread safety rules about pool settings (At most one write, prior to the first client) are implemented uniformly for both structured logs and APM callbacks. Edit: For compatibility, previously allowed unsafe uses are now still allowed. A new warning message will be reported as an unstructured log error.
Refined topology lifecycle events to match standard behavior
- Server 'opened' events are deferred until after the corresponding topology is 'opened', rather than until the APM callbacks have been set.
- Added missing end-of-life SDAM events. Before monitoring is closed, topology descriptions will now transition to an "Unknown" state as required. Server monitoring will be closed. Closure events are only emitted if the same log_and_monitor instance has seen a corresponding open.
Added test skips for failures corresponding to features that are currently out of scope:
- CMAP events, especially pool clearing on various error conditions
- Driver-generated connection IDs (optional in the spec, but some tests assume they are present)
- Thread entities in unified tests
Added new unified test runner features:
- Support for "closing" some entity types without deleting (For example, to inspect events logged when closing a client)
- SDAM APM events, and a new eventType filtering implementation
- New operations: recordTopologyDescription, assertTopologyType, waitForPrimaryChange
- Added a structured log filter stack (entity_map_log_filter_push/pop) used internally to refine the apparent behavior of waitForEvent
Added new structured log items:
- oid() for a plain ObjectID without hex representation
- topology_description_as_json() for plain topology descriptions not inside a topology

from commit d795d493c41022cb8ed15006ae5ac5ad85936f40

…ng client entity

To support SDAM tests, we need to access entities that have been closed.

…tance This replaces a few duplicated copies of the APM callbacks with a centralized copy maintained inside topology_t by mongoc_log_and_monitor_instance_t.

We had oid_as_hex, but this adds a plain bson ObjectId representation.

…lone

…logy scans

…et pool support

…am selection

Co-authored-by: Ezra Chung <[email protected]>

Relocate deprecation notice to upcoming 1.30.0 release.

…closed

eramongodb

Thank you for the explanations and clarifications. Re-suggesting some changes to coincide with additional motivations outlined in responses below, but otherwise, LGTM.

Re the extra warnings you're reporting, is there any plan to include them in an evergreen check?

No, these are warning I enable for local development (Clang's -Weverything + set of -Wno-* to disable irrelevant/undesirable warnings) to assist with drive-by code improvements and avoid expanding the set of possible warnings which might be observed by users (comparing a diff of total warnings for proposed changes against the base commit).

casting away const in an explicit conversion to (void*) seems to be a common enough pattern in the repository.

It is an unfortunately common pattern. The reason for the -Wcast-qual warning (which is emitted in spite of the explicit cast to void*) is to guard against the possibility of undefined behavior due to accessing a const-qualified object when the non-const void pointer is (expected to be) cast back to an object pointer. Recent code changes has been making efforts to improve the const-correctness of our interfaces even when type-erasure is involved (*_const functions).

Why do we need to disallow stateful filter functions?

The suggestion was based simply on the observation that no current usage of this pointer requires modifiable access to the pointed-to object: it is only being used for pointer equality comparison here and string comparison here. If your preference is to leave open the possibility for modifiable object pointers by using void*, that is fine.

My preference was to avoid the "unsafe" version unless there was a reason to prefer it.

Suggestion was not for performance reasons, but instead to communciate that synchronization is not required in this context (there should only be one thread accessing the to-be-destroyed topology object). That is, TSAN should rightfully warn if there are multiple threads accessing (un)guarded resources at this point.

I declined to document apm_mutex in additional detail here because I'm not convinced the current implementation is correct. [...] For this patch, I tried to preserve the existing behavior without changing it.

The proposed rationale was echoing the one given at the time the mutex was introduced (see: #607 (comment)).

My guess is that, at the time, thread-safe usage of the C Driver API may not have been an actively supported use case, therefore APM callback invocations were assumed to be always single-threaded (from the perspective of a user who registered callbacks) until the introduction of background server monitoring threads. This would explain why no other APM callback invocations are guarded.

This is for skipping events of other event types, when the "eventType" parameter is being used. When there's no "eventType", this never skips events.

I overlooked the specified behavior of eventType event checks. Thank you.

The skipping of cmap events would still be required.

It's been documented in the manpage.

Sorry, I was referencing outdated docs (generated locally). I see the Thread Safety section on the live website.

It's a workaround for allowing applications to see events which would otherwise be emitted earlier than the APM callbacks could be attached.

Thank you for the explanation.

It's not clear to me that an ASSERT here would help. [...] I don't think it would be an improvement to guarantee a crash after opening 2^64 topologies per process.

The proposed assertion is not to prevent collisions, but to avoid undefined behavior due to signed integer overflow during the atomic increment. The assertion would guard against undefined behavior by detecting the soon-to-occur condition and triggering well-defined erroneous behavior instead.

This implies that we could switch to using type-specific event lists, instead of having one event list that we filter.

I am fine with deferring this potential improvement to a separate PR.

eramongodb · 2025-01-30T16:21:13Z

src/libmongoc/src/mongoc/mongoc-topology.c

+   /* Before reporting this topology as closed, life cycle rules expect us to close
+    * all servers and transition to an unknown topology. */
+   {
+      mc_shared_tpld td = mc_tpld_take_ref (topology);


Consider suggestion once more, not for performance, but to communicate thread safety expectations within this context (synchronization "should not" be necessary here; allow TSAN to detect violation).

eramongodb · 2025-01-30T16:22:13Z

src/libmongoc/src/mongoc/mongoc-log-and-monitor-private.c

+   static int64_t serial_number_atomic = 0;
+   mongoc_log_and_monitor_serial_t result =
+      1 + mcommon_atomic_int64_fetch_add (&serial_number_atomic, 1, mcommon_memory_order_seq_cst);


Consider suggestion once more, not to prevent collisions, but to avoid undefined behavior due to signed integer overflow (during the atomic increment) by triggering well-defined erroneous behavior (assertion) instead before it can occur.

ghost · 2025-01-30T20:00:59Z

On Thu, Jan 30, 2025 at 8:24 AM Ezra Chung ***@***.***> wrote: > My preference was to avoid the "unsafe" version unless there was a reason to prefer it. Suggestion was not for performance reasons, but instead to communciate that synchronization is not required in this context (there should only be one thread accessing the to-be-destroyed topology object). That is, TSAN should rightfully warn if there are multiple threads accessing (un)guarded resources at this point.

Ahh, gotcha. Good idea. I'll make a note of that in the comments too.

I declined to document apm_mutex in additional detail here because I'm not convinced the current implementation is correct. [...] For this patch, I tried to preserve the existing behavior without changing it. The proposed rationale was echoing the one given at the time the mutex was introduced (see: #607 (comment)). My guess is that, at the time, thread-safe usage of the C Driver API may not have been an actively supported use case, therefore APM callback invocations were assumed to be always single-threaded (from the perspective of a user who registered callbacks) until the introduction of background server monitoring threads. This would explain why no other APM callback invocations are guarded.

Ok. I'll add an explanation that references the original motive plus the potential pitfall.

> It's not clear to me that an ASSERT here would help. [...] I don't think it would be an improvement to guarantee a crash after opening 2^64 topologies per process. The proposed assertion is not to prevent collisions, but to avoid undefined behavior due to signed integer overflow during the atomic increment. The assertion would guard against undefined behavior by detecting the soon-to-occur condition and triggering well-defined erroneous behavior instead.

I understand, I'm just not sure it's an improvement. I'll think about a better solution. My point of view is that UB in this case is going to be theoretical only, and we are trading a higher level of purity with respect to the C spec for an actual runtime fault. In any case this is far into the weeds with unlikely events. The incrementing counter could be exchanged with any other locally unique identifier.

> This implies that we could switch to using type-specific event lists, instead of having one event list that we filter. I am fine with deferring this potential improvement to a separate PR.

Sure. I'm not sure it's necessary to prioritize this, the unified test runner could really use a more thorough overhaul. Thanks, --micah

ghost · 2025-01-30T22:06:51Z

I'm going to make the mongoc_log_and_monitor_serial_t a bson_oid_t so we can use the centralized copy of this technically-undefined behavior that exists in _bson_context_set_oid_seq64.

src/libmongoc/CMakeLists.txt

src/libmongoc/src/mongoc/mongoc-log-and-monitor-private.h

src/libmongoc/src/mongoc/mongoc-topology.c

src/libmongoc/src/mongoc/mongoc-topology-description-private.h

This reverts 'opening' back to a single boolean flag, and stops trying to preserve the old behavior where unset callbacks would defer the apparent beginning of server monitoring. Feedback is that the old behavior was unintended, and it's better not to attempt to retain it.

ghost · 2025-02-04T17:02:16Z

Kevin and I discussed the changes to lazy-opening. The old behavior where server open could be deferred due to an unset callback is not necessarily intended, and not worth preserving. Without this requirement, we can avoid making lazy-opening more complicated. Now version_id is gone, and 'opened' is back to a boolean flag.

…on_add_server

ghost · 2025-02-04T19:36:58Z

Waiting to merge until I can get a clean CI run, which is currently blocked on 8.0.5-rc1

Micah Scott added 30 commits January 28, 2025 14:58

sync SDAM spec tests

25d922c

from commit d795d493c41022cb8ed15006ae5ac5ad85936f40

Add server_discovery_and_monitoring/unified tests

133c251

Add skips for SDAM unified tests that require pool support

4d25c8b

unified: error instead of segfault when checking events/logs on missi…

f149d2d

…ng client entity

unified entities: make 'close' distinct from 'delete'

e518e64

To support SDAM tests, we need to access entities that have been closed.

log-and-monitor type for apm monitoring state plus structured log ins…

ccceffe

…tance This replaces a few duplicated copies of the APM callbacks with a centralized copy maintained inside topology_t by mongoc_log_and_monitor_instance_t.

structured_log: add 'oid' item

183fa8d

We had oid_as_hex, but this adds a plain bson ObjectId representation.

structured log: "Starting topology monitoring"

bede749

Debug logging about suppressed logs in unified tests

068f9ff

td logging: monitor_changed even when apm monitoring callbacks unset

e66620f

structured log items for topology description from topology or standa…

05cf0ae

…lone

structured log: "Topology description changed"

9f8fa6e

unified tests: improve error logging about structured logs

988946b

structured log: "Starting server monitoring"

c9b5f20

monitoring: reset 'opened' state when instance changes

8633fe5

structured log: "Stopped server monitoring"

6073776

structured log: "Stopped topology monitoring"

43215fe

unified test runner: heartbeat started/succeeded/failed events

9f136fd

entity-map: pass server monitoring mode URI option

7dd15d9

unified tests: runCommand should report reply as result for matching

ce7bf62

unified tests: allow sdam eventType

1016f19

server-monitor heartbeat structured log messages

415d3a7

heartbeat structured logs for topology scanner

7b2abe4

waitForEvent: do an entire stream-selection, to advance blocking topo…

f4b2fd5

…logy scans

I think we need to skip serverMonitoringMode unified tests until we g…

dafbf39

…et pool support

unified waitForEvent: Do not skip structured logs from synthetic stre…

e99cd54

…am selection

Skip heartbeat logging tests that depend on CMAP connection ID

2db0f72

unified runner: topology open/close APM events

86be66f

Several more skips for tests that depend on CMAP features

1bb0ae6

Comment clarification

5f94175

mdbmes and others added 8 commits January 29, 2025 17:40

suggested comment

722475c

Co-authored-by: Ezra Chung <[email protected]>

suggested "static" keyword for test function

f69df45

Co-authored-by: Ezra Chung <[email protected]>

suggested INT64_C macros

64861c6

Co-authored-by: Ezra Chung <[email protected]>

fix NEWS entry (#1844)

6c70af7

Relocate deprecation notice to upcoming 1.30.0 release.

suggested additional whitespace for skipped tests list

be245b4

Fix documented param name

87fbfab

Move up comment about closed entities

a5162c1

Additional comment requested in _mongoc_topology_description_monitor_…

34b647b

…closed

ghost requested a review from eramongodb January 30, 2025 02:27

eramongodb approved these changes Jan 30, 2025

View reviewed changes

Micah Scott added 6 commits January 30, 2025 14:50

Prefer unlocked td access during topology destroy

d658671

Allow unsafe mongoc_client_pool_set_apm_callbacks usage

8d86d15

Replace mongoc_log_and_monitor_serial_t with bson_oid_t

10a3bd2

Expanded comment about apm_mutex

87e953d

Merge branch 'master' into CDRIVER-4500

494b888

Additional comment explaining the SDAM lifecycle workaround

91ccb6a

kevinAlbs reviewed Feb 3, 2025

View reviewed changes

Micah Scott added 4 commits February 3, 2025 14:22

Private header shouldn't be in HEADERS

1300364

const parameter for mongoc_log_and_monitor_instance_set_apm_callbacks

a7404ca

Add NEWS about topology event changes

fcc1446

Micah Scott added 2 commits February 4, 2025 09:19

Defer server opening until after topology is opened

ec2009c

Update comment about lazy server opening in mongoc_topology_descripti…

ec86c49

…on_add_server

kevinAlbs approved these changes Feb 4, 2025

View reviewed changes

ghost merged commit a91d6f6 into mongodb:master Feb 5, 2025
43 checks passed

eramongodb mentioned this pull request Feb 12, 2025

CXX-3208 update SDAM monitoring tests following mongo-c-driver a91d6f6a mongodb/mongo-cxx-driver#1332

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CDRIVER-4500: SDAM structured log and unified test support #1842

CDRIVER-4500: SDAM structured log and unified test support #1842

Uh oh!

ghost commented Jan 29, 2025 •

edited by ghost

Loading

Uh oh!

eramongodb left a comment •

edited

Loading

Uh oh!

eramongodb Jan 30, 2025 •

edited

Loading

Uh oh!

eramongodb Jan 30, 2025

Uh oh!

ghost commented Jan 30, 2025 via email

Uh oh!

ghost commented Jan 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghost commented Feb 4, 2025

Uh oh!

ghost commented Feb 4, 2025

Uh oh!

Uh oh!

Uh oh!

CDRIVER-4500: SDAM structured log and unified test support #1842

CDRIVER-4500: SDAM structured log and unified test support #1842

Uh oh!

Conversation

ghost commented Jan 29, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eramongodb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eramongodb Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eramongodb Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

ghost commented Jan 30, 2025 via email

Uh oh!

ghost commented Jan 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghost commented Feb 4, 2025

Uh oh!

ghost commented Feb 4, 2025

Uh oh!

Uh oh!

Uh oh!

ghost commented Jan 29, 2025 •

edited by ghost

Loading

eramongodb left a comment •

edited

Loading

eramongodb Jan 30, 2025 •

edited

Loading