Skip to content

Add SORT_STRICT option to array_unique() #7806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cscott
Copy link
Contributor

@cscott cscott commented Dec 21, 2021

Provide a way to request the use of strict comparison (===) when using array_unique.

@cscott cscott force-pushed the array_unique_strict branch from 17e92f3 to 5375ac5 Compare December 22, 2021 20:24
@cscott
Copy link
Contributor Author

cscott commented Dec 22, 2021

Ok, latest push should break ties in all situations where the values are == but not ===.

@cscott cscott force-pushed the array_unique_strict branch 2 times, most recently from 3b66d29 to 9a8d8a4 Compare December 22, 2021 22:09

case IS_OBJECT:
{
// Start with a recursive comparison like for arrays, for consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this only compare the object ID? That should be sufficient to establish an ordering.

Copy link
Contributor Author

@cscott cscott Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the object ID leads to nondeterministic behavior. In my experience, having an array be sorted slightly differently every time you run the program is the sort of thing that leads to headaches. Starting with a recursive comparison of attributes and properties is consistent with the way arrays are sorted, and the way that objects and arrays are compared with == (ie, that's the behavior of zend_std_compare_objects). Comparing class names is an extra attempt to avoid non-determinism. The object ID is only used as a fallback if your objects really are identical in all other ways. Ie, the way I've implemented it here the order of the == objects will be nondeterministic, but all the other != objects will have a consistent order from run to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in latest patch added test cases to document predictable order of non-identical objects, dropped the class name comparison.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and to be clear by "nondeterminism" I mean that adding a new object allocation somewhere in your program, or reordering some code which performs allocations, can cause a nonlocal change in the sort order elsewhere. I believe in PHP the object IDs are actually, strictly-speaking, deterministic given identical inputs to a program, but in practice in large programs it "feels" nondeterministic because of the effect of nonlocal changes.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on what the goal here is. I don't think SORT_STRICT is particularly useful outside array_unique(). And array_unique() just needs some arbitrary order, so it should be the most efficient such order.

It may be worthwhile to accept SORT_STRICT in array_unique() only, to avoid exposing any particular strict ordering to userland -- it becomes an unobservable implementation detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I started out thinking about just array_unique, but once you define it as a constant it seems to invite being used for all the different array methods which accept flags. And I can think of use cases where a strict ordering is useful, for example anywhere where string and numeric values are mixed but you don't want the "0", "0.0", and 0 keys to all be considered the same. (SORT_STRING will conflate "0" and 0, and SORT_DEFAULT and SORT_NUMERIC will conflate "0" and "0.0"). Similarly if you want null to be a valid value sorted distinct from "".
"Sort, but don't do any conversion on the keys" actually seems like a genuinely-useful feature.

$a = [ "0", "0.0", "", null, "", "0.0", "0" ];
$flags = 0; // Nothing we can put here will put all identical values together
asort($a, $flags);
var_dump($a);

I'm not convinced that the sort order for array_unique is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that the sort order for array_unique is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.

It's unobservable.

php > var_export(array_unique([10,1,0,0,2]));
array (
  0 => 10,
  1 => 1,
  2 => 0,
  4 => 2,
)

https://www.php.net/array_unique

Note that keys are preserved. If multiple elements compare equal under the given flags, then the key and value of the first equal element will be retained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason I'd personally avoid comparing properties is because recursive comparisons would be a fatal error, and recursive object references are much, much more common than recursive array references. For example, trees (e.g. Abstract Syntax Trees) might have parents reference childs, and childs reference parent

(Or an object and the implementation of an object that handles details of serializing the object to a cache such as memcache)

Comparing two different MyObject instances would result in a fatal error from infinite recursion detection.

php > class MyObject { public $loader; public $other; }
php > class MyLoader { public $object; }
php > $object = new MyObject();
php > $object->loader = new MyLoader();
php > $object->loader->object = $object;
php > var_dump($object);
object(MyObject)#1 (2) {
  ["loader"]=>
  object(MyLoader)#2 (1) {
    ["object"]=>
    *RECURSION*
  }
  ["other"]=>
  NULL
}

ZVAL_DEREF(z2);
// The most important thing about this comparison mode is that the result
// is 0 when zend_is_identical, and non-zero otherwise.
if (zend_is_identical(z1, z2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you're enumerating all cases below anyway, I'd drop this check and handle equality below as well. For arrays that would save one deep comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it makes it harder to maintain the invariant, and less obvious if there's some subtle bug somewhere that causes the output of zend_is_identical and this function to diverge. It's easier (IMO) to determine by inspection that the only way a zero can be returned from this function is if zend_is_identical returns zero. If you're worried about performance I could probably exclude objects and arrays from this comparison -- arrays involve a call to zend_hash_compare anyway so the "obvious by inspection it can't return 0" thing isn't as obviously true there anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(avoided the deep comparison in latest patch, as requested.)

@cscott cscott force-pushed the array_unique_strict branch 2 times, most recently from 908f282 to 51ae411 Compare December 23, 2021 16:43
@cscott
Copy link
Contributor Author

cscott commented Dec 23, 2021

Updated the patch to simplify and correct the comparison: we no longer use class name for objects; we avoid the redundant deep inspection for arrays; and we use ordered comparison of properties for objects. Added additional test cases to document the intended behavior, including the deterministic ordering of objects with non-identical properties.

Provide a way to request the use of strict comparison (===) when using
`array_unique`.  SORT_STRICT does no type conversion: values of
different types are ordered by their type.  For array/object values
which are == but not === we recursively compare the
elements/properties.  Arrays with identical elements will be identical
themselves; but for objects with identical properties which are not
identical we break the tie using the `spl_object_id()` of the objects.
@cscott cscott force-pushed the array_unique_strict branch from 51ae411 to 4c26ec5 Compare December 23, 2021 16:45
Copy link
Contributor

@TysonAndre TysonAndre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally, I'd also been working on stable sorting lately and didn't remember your PR, found it again when searching past bugs/proposals.

https://github.com/TysonAndre/pecl-teds/blob/0.5.1/teds.c#L567-L657 had some differences in the approach

  1. I'd agree with earlier comments. If this is only meant to be for array_unique, then it should be rejected for other sort functions, and possibly given a different name (ARRAY_FILTER_STRICT, ARRAY_FILTER_SORT_STRICT_ORDER, or something)

  2. This should handle NAN. Operators involving NAN are always false in C/PHP. This would break the ordering of floats in arrays. (except NAN !== NAN is true)

E.g. if 2.0 < NAN is false, and NAN < 2.0 is false, we may get the ordering 2.0, NAN, 2.0 and have duplicate values of 2.

php > var_export(1.0 <=> NAN);
1
php > var_export(NAN <=> NAN);
1
php > var_export(NAN <=> 1.0);
1
  1. Some object types make no sense to compare by property set, e.g. Closure or GMP have no properties to compare.
php > $x = gmp_init(1);
php > var_export((string)$x);
'1'
php > var_export(get_object_vars($x));
array (
)
php > var_export($x);
GMP::__set_state(array(
))

Maybe changing the second argument of array_unique to int|Closure would help (Closure(int $a, int $b): int for comparator, like usort), so that users could specify custom comparators for objects.

  • Not sure if there'd be enough interest in allowing this, but it seems to have use cases, e.g. for array_unique($items, fn($a, $b) => $a->id <=> $b->id); without creating an extra array of just ids and intersecting keys
  • (this may need extra checks to guard against side effects of functions during sorting and stop/recover if there's an exception?)

EDIT: This was apparently proposed in https://externals.io/message/70060 but not really followed up on.

  1. If this is exposed for *sort to end users, then 1.5 < (int)2 should be true (3/2 < 4/2 would be intuitive), and compare(2, 2.0) != 0 should also be true. See comment below t1 != t2


case IS_OBJECT:
{
// Start with a recursive comparison like for arrays, for consistency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that the sort order for array_unique is an "unobservable implementation detail" -- I think the actual sort order chosen is still revealed by the order of the keys in the returned array. I have to think about that some more.

It's unobservable.

php > var_export(array_unique([10,1,0,0,2]));
array (
  0 => 10,
  1 => 1,
  2 => 0,
  4 => 2,
)

https://www.php.net/array_unique

Note that keys are preserved. If multiple elements compare equal under the given flags, then the key and value of the first equal element will be retained.

return Z_DVAL_P(z1) > Z_DVAL_P(z2) ? 1 : -1;

case IS_STRING:
return zend_binary_strcmp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: zend_binary_zval_strcmp(z1, z2) does the same thing and is a bit shorter to read

// (This is deliberately not using the user-defined `compare` handler,
// nor is it using zend_std_compare_objects() because that uses
// zend_compare when examining properties, not a strict comparison.)
zend_object *zobj1 = Z_OBJ_P(z1);
Copy link
Contributor

@TysonAndre TysonAndre Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are planning on proposing a sort function: Should this compare class names as well? E.g. with strcmp before building and comparing object properties.

If not, document why not?

we no longer use class name for objects;

EDIT: document why not, I didn't see the reason why in discussion

// (Values of different types can't be identical.)
int t1 = Z_TYPE_P(z1);
int t2 = Z_TYPE_P(z2);
if ( t1 != t2 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( t1 != t2 ) {
if (t1 != t2) {

int t1 = Z_TYPE_P(z1);
int t2 = Z_TYPE_P(z2);
if ( t1 != t2 ) {
return (t1 > t2 ) ? 1 : -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be used for a sort exposed to end users, than having 2 < 1.5 (IS_LONG < IS_DOUBLE) seems unintuitive. (it's correct if it's only used for array_unique, though)

E.g. for [4/2, 3/2], that's the int 2 followed by float(1.5), since integer division results in integers when exact and floats otherwise.

https://github.com/TysonAndre/pecl-teds/blob/0.5.1/teds.c#L577-L618 handles some of the edge cases. Those edge cases are https://github.com/TysonAndre/pecl-teds/blob/0.5.1/tests/misc/stable_compare.phpt#L75-L86


case IS_OBJECT:
{
// Start with a recursive comparison like for arrays, for consistency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason I'd personally avoid comparing properties is because recursive comparisons would be a fatal error, and recursive object references are much, much more common than recursive array references. For example, trees (e.g. Abstract Syntax Trees) might have parents reference childs, and childs reference parent

(Or an object and the implementation of an object that handles details of serializing the object to a cache such as memcache)

Comparing two different MyObject instances would result in a fatal error from infinite recursion detection.

php > class MyObject { public $loader; public $other; }
php > class MyLoader { public $object; }
php > $object = new MyObject();
php > $object->loader = new MyLoader();
php > $object->loader->object = $object;
php > var_dump($object);
object(MyObject)#1 (2) {
  ["loader"]=>
  object(MyLoader)#2 (1) {
    ["object"]=>
    *RECURSION*
  }
  ["other"]=>
  NULL
}

@TysonAndre
Copy link
Contributor

Definitely not suggesting switching approach or adding to scope, just making a note of an idea I've had for how php could expose object comparisons safely in arrays of objects - Implementing an extensible Comparator class seems doable but would require much more familiarity with internals to guard against comparison callbacks mutating parent objects/arrays while recursing on zvals, and possibly detecting infinite recursion (adding and decrementing reference counts with GC_ADDREF/GC_DELREF)

  • A Comparator/Collections\Comparator may be of use for implementing things like a extensible SortedSet in userland or internally, to handle arrays of objects and other values.

Related past/current RFCs tried to change objects themselves:

// Extensible internal class - user-defined classes can override compare, compareObjects, compareStrings
// $c = Closure::fromCallable(new ComparatorOrSubclass()); can be used to create closures to pass elsewhere, e.g. array_unique($values, $c);
// Allows tracking state and customizations while comparing.
class Comparator {
    /** Compares two top-level values but not inner values, in an overridable way (e.g. to allow ReverseComparator, custom behavior only for top-level objects, etc.) */
    public function __invoke(mixed $a, mixed $b): int {
    }
    /** Compares two objects. Can be extended to customize for specific use cases. Defaults to .... */
    public function compareObjects(object $a, object $b): int {
    }
    /** Compares two strings. Defaults to same ordering as strcmp. */
    public function compareStrings(string $a, string $b): int {
    }
    // I expect php to remove resources at some point.
}

@github-actions
Copy link

There has not been any recent activity in this PR. It will automatically be closed in 7 days if no further action is taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants