-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add SORT_STRICT option to array_unique() #7806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
17e92f3
to
5375ac5
Compare
Ok, latest push should break ties in all situations where the values are |
3b66d29
to
9a8d8a4
Compare
|
||
case IS_OBJECT: | ||
{ | ||
// Start with a recursive comparison like for arrays, for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this only compare the object ID? That should be sufficient to establish an ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the object ID leads to nondeterministic behavior. In my experience, having an array be sorted slightly differently every time you run the program is the sort of thing that leads to headaches. Starting with a recursive comparison of attributes and properties is consistent with the way arrays are sorted, and the way that objects and arrays are compared with == (ie, that's the behavior of zend_std_compare_objects
). Comparing class names is an extra attempt to avoid non-determinism. The object ID is only used as a fallback if your objects really are identical in all other ways. Ie, the way I've implemented it here the order of the ==
objects will be nondeterministic, but all the other !=
objects will have a consistent order from run to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(in latest patch added test cases to document predictable order of non-identical objects, dropped the class name comparison.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and to be clear by "nondeterminism" I mean that adding a new object allocation somewhere in your program, or reordering some code which performs allocations, can cause a nonlocal change in the sort order elsewhere. I believe in PHP the object IDs are actually, strictly-speaking, deterministic given identical inputs to a program, but in practice in large programs it "feels" nondeterministic because of the effect of nonlocal changes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on what the goal here is. I don't think SORT_STRICT is particularly useful outside array_unique(). And array_unique() just needs some arbitrary order, so it should be the most efficient such order.
It may be worthwhile to accept SORT_STRICT in array_unique() only, to avoid exposing any particular strict ordering to userland -- it becomes an unobservable implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I started out thinking about just array_unique
, but once you define it as a constant it seems to invite being used for all the different array methods which accept flags. And I can think of use cases where a strict ordering is useful, for example anywhere where string and numeric values are mixed but you don't want the "0"
, "0.0"
, and 0
keys to all be considered the same. (SORT_STRING
will conflate "0"
and 0
, and SORT_DEFAULT
and SORT_NUMERIC
will conflate "0"
and "0.0"
). Similarly if you want null
to be a valid value sorted distinct from ""
.
"Sort, but don't do any conversion on the keys" actually seems like a genuinely-useful feature.
$a = [ "0", "0.0", "", null, "", "0.0", "0" ];
$flags = 0; // Nothing we can put here will put all identical values together
asort($a, $flags);
var_dump($a);
I'm not convinced that the sort order for array_unique
is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the sort order for array_unique is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.
It's unobservable.
php > var_export(array_unique([10,1,0,0,2]));
array (
0 => 10,
1 => 1,
2 => 0,
4 => 2,
)
https://www.php.net/array_unique
Note that keys are preserved. If multiple elements compare equal under the given flags, then the key and value of the first equal element will be retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason I'd personally avoid comparing properties is because recursive comparisons would be a fatal error, and recursive object references are much, much more common than recursive array references. For example, trees (e.g. Abstract Syntax Trees) might have parents reference childs, and childs reference parent
(Or an object and the implementation of an object that handles details of serializing the object to a cache such as memcache)
Comparing two different MyObject instances would result in a fatal error from infinite recursion detection.
php > class MyObject { public $loader; public $other; }
php > class MyLoader { public $object; }
php > $object = new MyObject();
php > $object->loader = new MyLoader();
php > $object->loader->object = $object;
php > var_dump($object);
object(MyObject)#1 (2) {
["loader"]=>
object(MyLoader)#2 (1) {
["object"]=>
*RECURSION*
}
["other"]=>
NULL
}
ext/standard/array.c
Outdated
ZVAL_DEREF(z2); | ||
// The most important thing about this comparison mode is that the result | ||
// is 0 when zend_is_identical, and non-zero otherwise. | ||
if (zend_is_identical(z1, z2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you're enumerating all cases below anyway, I'd drop this check and handle equality below as well. For arrays that would save one deep comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it makes it harder to maintain the invariant, and less obvious if there's some subtle bug somewhere that causes the output of zend_is_identical and this function to diverge. It's easier (IMO) to determine by inspection that the only way a zero can be returned from this function is if zend_is_identical returns zero. If you're worried about performance I could probably exclude objects and arrays from this comparison -- arrays involve a call to zend_hash_compare anyway so the "obvious by inspection it can't return 0" thing isn't as obviously true there anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(avoided the deep comparison in latest patch, as requested.)
908f282
to
51ae411
Compare
Updated the patch to simplify and correct the comparison: we no longer use class name for objects; we avoid the redundant deep inspection for arrays; and we use ordered comparison of properties for objects. Added additional test cases to document the intended behavior, including the deterministic ordering of objects with non-identical properties. |
Provide a way to request the use of strict comparison (===) when using `array_unique`. SORT_STRICT does no type conversion: values of different types are ordered by their type. For array/object values which are == but not === we recursively compare the elements/properties. Arrays with identical elements will be identical themselves; but for objects with identical properties which are not identical we break the tie using the `spl_object_id()` of the objects.
51ae411
to
4c26ec5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coincidentally, I'd also been working on stable sorting lately and didn't remember your PR, found it again when searching past bugs/proposals.
https://github.com/TysonAndre/pecl-teds/blob/0.5.1/teds.c#L567-L657 had some differences in the approach
-
I'd agree with earlier comments. If this is only meant to be for array_unique, then it should be rejected for other sort functions, and possibly given a different name (ARRAY_FILTER_STRICT, ARRAY_FILTER_SORT_STRICT_ORDER, or something)
-
This should handle NAN. Operators involving NAN are always false in C/PHP. This would break the ordering of floats in arrays. (except
NAN !== NAN
is true)
E.g. if 2.0 < NAN
is false, and NAN < 2.0
is false, we may get the ordering 2.0, NAN, 2.0 and have duplicate values of 2.
php > var_export(1.0 <=> NAN);
1
php > var_export(NAN <=> NAN);
1
php > var_export(NAN <=> 1.0);
1
- Some object types make no sense to compare by property set, e.g. Closure or GMP have no properties to compare.
php > $x = gmp_init(1);
php > var_export((string)$x);
'1'
php > var_export(get_object_vars($x));
array (
)
php > var_export($x);
GMP::__set_state(array(
))
Maybe changing the second argument of array_unique
to int|Closure
would help (Closure(int $a, int $b): int
for comparator, like usort), so that users could specify custom comparators for objects.
- Not sure if there'd be enough interest in allowing this, but it seems to have use cases, e.g. for
array_unique($items, fn($a, $b) => $a->id <=> $b->id);
without creating an extra array of just ids and intersecting keys - (this may need extra checks to guard against side effects of functions during sorting and stop/recover if there's an exception?)
EDIT: This was apparently proposed in https://externals.io/message/70060 but not really followed up on.
- If this is exposed for
*sort
to end users, then1.5 < (int)2
should be true (3/2 < 4/2 would be intuitive), and compare(2, 2.0) != 0 should also be true. See comment belowt1 != t2
|
||
case IS_OBJECT: | ||
{ | ||
// Start with a recursive comparison like for arrays, for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the sort order for array_unique is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.
It's unobservable.
php > var_export(array_unique([10,1,0,0,2]));
array (
0 => 10,
1 => 1,
2 => 0,
4 => 2,
)
https://www.php.net/array_unique
Note that keys are preserved. If multiple elements compare equal under the given flags, then the key and value of the first equal element will be retained.
return Z_DVAL_P(z1) > Z_DVAL_P(z2) ? 1 : -1; | ||
|
||
case IS_STRING: | ||
return zend_binary_strcmp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: zend_binary_zval_strcmp(z1, z2) does the same thing and is a bit shorter to read
// (This is deliberately not using the user-defined `compare` handler, | ||
// nor is it using zend_std_compare_objects() because that uses | ||
// zend_compare when examining properties, not a strict comparison.) | ||
zend_object *zobj1 = Z_OBJ_P(z1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are planning on proposing a sort function: Should this compare class names as well? E.g. with strcmp before building and comparing object properties.
If not, document why not?
we no longer use class name for objects;
EDIT: document why not, I didn't see the reason why in discussion
// (Values of different types can't be identical.) | ||
int t1 = Z_TYPE_P(z1); | ||
int t2 = Z_TYPE_P(z2); | ||
if ( t1 != t2 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( t1 != t2 ) { | |
if (t1 != t2) { |
int t1 = Z_TYPE_P(z1); | ||
int t2 = Z_TYPE_P(z2); | ||
if ( t1 != t2 ) { | ||
return (t1 > t2 ) ? 1 : -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is meant to be used for a sort exposed to end users, than having 2 < 1.5
(IS_LONG < IS_DOUBLE) seems unintuitive. (it's correct if it's only used for array_unique, though)
E.g. for [4/2, 3/2]
, that's the int 2 followed by float(1.5), since integer division results in integers when exact and floats otherwise.
https://github.com/TysonAndre/pecl-teds/blob/0.5.1/teds.c#L577-L618 handles some of the edge cases. Those edge cases are https://github.com/TysonAndre/pecl-teds/blob/0.5.1/tests/misc/stable_compare.phpt#L75-L86
|
||
case IS_OBJECT: | ||
{ | ||
// Start with a recursive comparison like for arrays, for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason I'd personally avoid comparing properties is because recursive comparisons would be a fatal error, and recursive object references are much, much more common than recursive array references. For example, trees (e.g. Abstract Syntax Trees) might have parents reference childs, and childs reference parent
(Or an object and the implementation of an object that handles details of serializing the object to a cache such as memcache)
Comparing two different MyObject instances would result in a fatal error from infinite recursion detection.
php > class MyObject { public $loader; public $other; }
php > class MyLoader { public $object; }
php > $object = new MyObject();
php > $object->loader = new MyLoader();
php > $object->loader->object = $object;
php > var_dump($object);
object(MyObject)#1 (2) {
["loader"]=>
object(MyLoader)#2 (1) {
["object"]=>
*RECURSION*
}
["other"]=>
NULL
}
Definitely not suggesting switching approach or adding to scope, just making a note of an idea I've had for how php could expose object comparisons safely in arrays of objects - Implementing an extensible Comparator class seems doable but would require much more familiarity with internals to guard against comparison callbacks mutating parent objects/arrays while recursing on zvals, and possibly detecting infinite recursion (adding and decrementing reference counts with GC_ADDREF/GC_DELREF)
Related past/current RFCs tried to change objects themselves:
// Extensible internal class - user-defined classes can override compare, compareObjects, compareStrings
// $c = Closure::fromCallable(new ComparatorOrSubclass()); can be used to create closures to pass elsewhere, e.g. array_unique($values, $c);
// Allows tracking state and customizations while comparing.
class Comparator {
/** Compares two top-level values but not inner values, in an overridable way (e.g. to allow ReverseComparator, custom behavior only for top-level objects, etc.) */
public function __invoke(mixed $a, mixed $b): int {
}
/** Compares two objects. Can be extended to customize for specific use cases. Defaults to .... */
public function compareObjects(object $a, object $b): int {
}
/** Compares two strings. Defaults to same ordering as strcmp. */
public function compareStrings(string $a, string $b): int {
}
// I expect php to remove resources at some point.
} |
There has not been any recent activity in this PR. It will automatically be closed in 7 days if no further action is taken. |
Provide a way to request the use of strict comparison (===) when using
array_unique
.