Skip to content

Use HT for recursion protection in JSON encode #7589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nikic
Copy link
Member

@nikic nikic commented Oct 18, 2021

The jsonSerialize() method might access recursion-protected objects/arrays, in which case other operations like var_dump() may break. This also fixes https://bugs.php.net/bug.php?id=81524.

@tstarling Thoughts?

@tstarling
Copy link
Contributor

The instruction count increases by about 40% for a tight loop, from 350 to 493 instructions per element. GC_PROTECTED is fast and will be hard to beat, but 40% seems like a lot.

$a = [];
for ($i = 0; $i < 1000; $i++) {
	$a[] = [$i];
}
for ( $i = 0; $i < 100000; $i++ ) {
	json_encode($a);
}

@nikic
Copy link
Member Author

nikic commented Oct 19, 2021

Yeah, the overhead here is non-trivial. Don't really see how to improve on it though.

I think the two alternatives would be a) simply ignore any weird interactions by different recursion protections -- it rarely matters or b) switch to a two-level recursion protection: The first user can make use of the GC_PROTECTED flag, while a nested second user would fall back to an HT instead. This would ensure that there is both little performance impact for the average case, and that there is no interference between different recursers. This would be a more intrusive change that needs to cover other users of GC_PROTECTED as well though.

@tstarling
Copy link
Contributor

Object handles could be used as an index into an array of bits -- you could realloc a global array as more handles appear. The relative memory overhead would only be 1/8/sizeof(zend_object) = 0.2%. Doesn't help with arrays, and there's that TODO in the code about removing zend_object.handle, but maybe it could solve the bug at least.

@bukka
Copy link
Member

bukka commented Nov 1, 2021

I think as there's already json specific constant in Zend code ( ZEND_PROP_PURPOSE_JSON ) it might just be easier to also have a special protection for JSON. Basically the patch attached in the bug from @tstarling with renaming GC_PROTECTED2 to something like GC_PROTECTED_JSON which would be used only by JSON ext.

@nikic
Copy link
Member Author

nikic commented Nov 3, 2021

@bukka That would require reserving an additional GC bit. I don't think that makes sense for a JSON-only use case.

In any case, closing this one as the overhead is too large.

if (GC_FLAGS(rc) & GC_IMMUTABLE) {
return SUCCESS;
}
if (zend_hash_index_add_empty_element(&encoder->recursive, (uintptr_t) rc)) {
Copy link
Contributor

@TysonAndre TysonAndre Oct 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note if anyone bases functionality on this in the future or restores this:

As I'd discovered in #7690 - using pointers as hash indexes directly leads to a lot of hash collisions and performance issues. Shifting by ZEND_MM_ALIGNED_OFFSET_LOG2 instead helps noticeably (to work with both malloc and emalloc)

(e.g. if 44-byte zend_array instances (on 64-bit platforms) are aligned to 16 bytes in practice with emalloc on a platform (low bit of a pointer is the byte address), then they'll all collide on the same 1 in 16 hash buckets)

if (GC_FLAGS(rc) & GC_IMMUTABLE) {
return;
}
GC_DELREF(rc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the original code was increasing the reference count of the properties table, and decreasing the reference count of the properties table, to prevent the property table from getting freed during iteration.

Now that we're referencing the object, your PR is calling GC_ADDREF and GC_DELREF to prevent the object from getting freed during iteration.

If this PR were to be reopened or a if someone were to base code on this in the future, would it need to check if GC_DELREF returns 0 and free the array/object in question, to avoid leaks if jsonSerialize removed the last reference to a value as a side effect? i.e. call rc_dtor_func if unexpectedly 0

ZEND_API void ZEND_FASTCALL rc_dtor_func(zend_refcounted *p)

@TysonAndre
Copy link
Contributor

TysonAndre commented Oct 23, 2022


A brand new type for unordered sets of non-null (non-0) pointers (only supporting adding, removal, and membership checks) might have even better performance (only for internal use within json.h, not an exported api).

https://github.com/igbinary/igbinary/blob/master/src/php7/hash_si_ptr.c along the lines of what is used there

  • For a hash table of pointers/zend_long that only needs addition, membership check, and removal, there's no need for allocating memory for buckets as 16-byte zvals

  • There doesn't even need to be an associated 16-byte zval in this case

  • If values are always added in order and removed in the opposite order, I think there's no need to mark buckets as wasted. They can be reset to 0/null instead

    That assumption is sadly almost definitely wrong because of Fibers, though, since JsonSerialize::serialize can switch to a different fiber, that fiber can start a call to json_encode, which can switch back to the original call

@@ -26,11 +26,20 @@ struct _php_json_encoder {
int depth;
int max_depth;
php_json_error_code error_code;
HashTable recursive;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately from the previous comment, there's the question of whether the json recursion protection should be per-request (in request globals) rather than per call to json_encoder (in request globals, in RINIT/RSHUTDOWN) instead.

My preference is for the former

E.g. a JsonSerializable::jsonSerialize implementation calling json_encode($this) would trigger infinite recursion with this PR (but not before this PR) if each call to json_encode had a distinct recursive hash table instance - since object property tables don't really change to different pointers in practice if it's necessary for json_encode to work.

@bukka
Copy link
Member

bukka commented Aug 26, 2023

Just for the record this was addressed by 53aa53f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants