Skip to content

C++: HashCons library #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Sep 18, 2018
Merged

C++: HashCons library #107

merged 32 commits into from
Sep 18, 2018

Conversation

rdmarsh2
Copy link
Contributor

A structural comparison library based on the hashconsing strategy for comparison. The implementation is similar to the existing global value numbering library, plus some tricky handling for function arguments.

@ghost
Copy link

ghost commented Aug 25, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@kevinbackhouse kevinbackhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend checking that every expression gets exactly one HC value. Just run a query like this on a few large databases:

https://github.com/Semmle/ql/blob/master/cpp/ql/test/library-tests/valuenumbering/GlobalValueNumbering/Uniqueness.ql

If you already did that then LGTM.

@rdmarsh2
Copy link
Contributor Author

There's a few categories of expressions that aren't covered yet: array literals will need a similar technique to function calls, struct literals may need something more complicated. New and delete can likely be done as an extension of the function call handling. Throw expressions should be simple to implement, as should sizeof, alignof, and company. I haven't decided whether assignments should be hash-consed or not; input would be appreciated.

@rdmarsh2 rdmarsh2 added the WIP This is a work-in-progress, do not merge yet! label Aug 25, 2018
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat library. I believe there's demand for it, so let's try to get it in shape for merging before 1.18.

result =
min(Expr e
| this = hashCons(e)
| e order by e.getLocation().toString())
Copy link
Contributor

@jbj jbj Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like have a comment and a test for what happens when multiple expressions have the same location because they come from a macro expansion. I'm not even sure whether the extractor produces one or two Location objects in that case. Example:

#define SQUARE(x)  ((x) * (x))
...
z = SQUARE(y+1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results there are very strange; it gives one location for the literal and two for the variable access and the addition. I think that's extractor weirdness, but it might be something in the library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, that seems to be one location for each variable access in the macro expansion, with loc1 != loc2, but loc1.subsumes(loc2) and loc2.subsumes(loc1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test; I'm still not sure what's going on here. @nickrolfe @ian-semmle can you comment on what's happening in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like the accesses to y are duplicated, but somehow there's only one row in the .expected file for the HashCons object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.

}

private predicate analyzableNonmemberFunctionCall(
FunctionCall fc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many functions in this file have unnecessary newlines in their parameter lists. Our QL style guide currently says a line can be up to 100 columns wide, and I personally think that's what we'll stick with in the future. But even with an 80-column limit, this parameter list can fit one a line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* expression with this `HC` and using its `toString` and `getLocation`
* methods.
*/
class HC extends HCBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name HC is far too short for a non-private class name, especially one that'll used infrequently. Why not call it HashCons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was for symmetry with GVN from the global value numbering library. I've switched to HashCons for the public class name

* methods.
*/
class HC extends HCBase {
HC() { this instanceof HCBase }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This charpred can just be deleted. It only repeats what extends already says.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private predicate mk_PointerFieldAccess(
HC qualifier, Field target,
PointerFieldAccess access) {
analyzablePointerFieldAccess(access) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation is off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

or
HC_ThisExpr(Function fcn) {
mk_ThisExpr(fcn,_) or
mk_ImplicitThisFieldAccess(fcn,_,_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library goes to great lengths to equate (*obj).field with obj->field. Instead of just having one case for field access, there are six predicates named like mk_*FieldAccess*. The interaction between those six predicates and these three constructors is non-obvious and scarcely documented.

I don't understand why this one case of desugaring should receive special handling in this library. There is no attempt to equate a + 1 with 1 + a, for example, or to equate *(a + i) with a[i]. Given that we have GlobalValueNumbering for the more semantic applications, shouldn't we leave HashCons to deal purely with surface syntax? That would simplify this code and make it easier to explain what this library does and does not do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a holdover from the GVN library. @kevinbackhouse can you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like implicit this is usually expanded in the extractor, so it may be better to leave that in place and add some notes about it, rather than making the treatment of an implicit this be inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rare to see an implicit this in the AST. I think it happens only for calls to the destructors of fields in compiler-generated destructors. I don't think that's something we need to worry about for HashCons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's worth simplifying the hash-consing of field accesses. I don't think it's important that myField gets equated with this.myField as those two would never occur in the same function because the extractor inserts explicit this. in user-written code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if it's that rare I'll take the logic out;.

}

/** Gets the hash-cons of expression `e`. */
cached HC hashCons(Expr e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether this predicate ends up in the same cached stage as the HC type. If not, we'll effectively compute this library twice. The cache layers can be found in the QL4E console log: search for "RESULTS IN" and see what's listed for each stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're in the same stage.

| mk_StringLiteral(val, t, e) and
result = HC_StringLiteral(val, t))
or
// Variable with no SSA information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be deleted. It's probably a copy-paste error from GVN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/**
* Holds if `fc` is a call to `fcn`, `fc`'s first `i-1` arguments have hash-cons
* `list`, and `fc`'s `i`th argument has hash-cons `hc`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is off by one. Where it says i-1 it should just say i. Take, for example, the case of i=0. Then it should say that "fc's first 0 arguments have hash-cons list, and fc's 0th argument has hash-cons hc.

Maybe also change "fc's ith argument" to "fc's argument at index i" or clarify in some other way that the first argument has i=0. Also add a full stop at the end of the sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

HC_Unanalyzable(Expr e) { not analyzableExpr(e,_) }

/** Used to implement hash-consing of argument lists */
private cached newtype HC_Args =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: When I looked into implementing a similar library for JavaScript, I got a considerable speed-up from adding special-case constructors for argument lists of length one and two. Have you investigated whether this makes a difference for C++?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't. I'll take a look at that, though.

@rdmarsh2 rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from 3c69e1d to 7b494c0 Compare August 27, 2018 20:54
forall(int i | exists(fc.getArgument(i)) | strictcount(fc.getArgument(i).getFullyConverted()) = 1) and
forall(int i |
exists(fc.getArgument(i)) |
strictcount(fc.getArgument(i).getFullyConverted()) = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exists has misleading indentation.

result =
min(Expr e
| this = hashCons(e)
| e order by e.getLocation().toString())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like the accesses to y are duplicated, but somehow there's only one row in the .expected file for the HashCons object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.

@rdmarsh2 rdmarsh2 added this to the 1.18 milestone Aug 29, 2018
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a lot of changes. I could have easily missed something in this review, so please read the code thoroughly yourself as well.

Given the extent and complexity of this library, you'll also need to add a sanity test in the style of https://github.com/Semmle/ql/blob/master/cpp/ql/test/library-tests/valuenumbering/GlobalValueNumbering/Uniqueness.ql and run it on some large snapshots. That will also give you an opportunity to re-check performance after this last round of updates.

mk_NewExpr(t, alloc, init, align, _, _)
} or
HC_NewArrayExpr(Type t, HC_Alloc alloc, HC_Init init, HC_Align align) {
mk_NewArrayExpr(t, alloc, init, align, _, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also include the number of elements allocated (NewArrayExpr.getExtent)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no; the type of the newArrayExpr is an ArrayType, so the size is included there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can its size be included there if it's not constant but an arbitrary expression? I can see you have a test for it (new int[x] != new int[z]), and that it appears to work, but how?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I see why it's working. You added a commit to fix it.

not exists(new.getInitializer())
or
strictcount(new.getInitializer()) = 1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can save 7 lines here by writing either count(new.getAllocatorCall()) <= 1 or not strictcount(new.getAllocatorCall()) > 1 and the same for getInitializer. I think the first version is clearer, and it should desugar to almost exactly the disjunction you've already written.


private predicate analyzableNewArrayExpr(NewArrayExpr new) {
strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and
strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines look identical.

not exists(new.getInitializer())
or
strictcount(new.getInitializer().getFullyConverted()) = 1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggested as in analyzableNewExpr.

fc.getNumberOfArguments() = 2
or
aligned = false and
fc.getNumberOfArguments() = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic with how the argument count differs depending on the allocation being aligned is repeated four times in this file. I think I've also seen it in the IR construction. Can you find a way to move into the AST classes so users of those classes don't have to know about these magic numbers?

}

private predicate mk_HasAlign(HashCons hc, NewOrNewArrayExpr new) {
hc = hashCons(new.getAlignmentArgument())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have .getFullyConverted() on it? Also in the two other places where getAlignmentArgument is called.

}

private predicate analyzableNewExpr(NewExpr new) {
strictcount(new.getAllocatedType()) = 1 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should also be a check for the count of getAlignmentArgument.getFullyConverted().

private newtype HC_Array =
HC_EmptyArray(Type t) {
exists(ArrayAggregateLiteral aal |
aal.getType() = t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call to getType without a call to getUnspecifiedType on it. Is that on purpose?

* Used to implement hash-consing of struct initizializers.
*/
private newtype HC_Fields =
HC_EmptyFields(Class c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library has now grown several similar "list of HashCons" types: HC_Args, HC_Fields, HC_Array, etc. I wonder if they could all be replaced with one type, similar to HC_Args but with the "cons" constructor body being a disjunction of all the relevant mk_ predicates. I'll leave it up to you whether that would be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HC_Fields and HC_Array may need to be separate to handle designated initializers, depending on what decision we make about them

}

private predicate analyzableTypeidType(TypeidOperator e) {
strictcount(e.getAChild()) = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison is always false as strictcount never returns 0.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did Uniqueness.ql and performance work out on large snapshots?

Remember to import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.

Are there any other unaddressed PR comments left?

@jbj
Copy link
Contributor

jbj commented Aug 31, 2018

#116 has been merged now, so I assume you want to sync up with that.

@rdmarsh2 rdmarsh2 changed the base branch from master to rc/1.18 August 31, 2018 16:14
@rdmarsh2 rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from fce931a to cd1403b Compare August 31, 2018 16:17
@rdmarsh2 rdmarsh2 removed the WIP This is a work-in-progress, do not merge yet! label Aug 31, 2018
@felicitymay
Copy link
Contributor

Should this be mentioned in the analysis change notes for 1.18?

@jbj
Copy link
Contributor

jbj commented Sep 4, 2018

@felicity-semmle Yes, if it makes it into 1.18.

@@ -33,4 +33,4 @@

## Changes to QL libraries

* *Series of bullet points*
* Added a hash consing library for structural comparison of expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer readers to where they can find the library. Maybe instead write "Added a new library semmle.code.cpp.valuenumbering.HashCons for structural comparison of expressions."

Also, this file is in conflict.

@rdmarsh2
Copy link
Contributor Author

rdmarsh2 commented Sep 6, 2018

I've cleared cache, then run the GVN Uniqueness.ql test to refill the cache, and then run the HashCons Uniqueness.ql to see the actual performance effects. ChakraCore spent 201 seconds in evaluation, lepton spent 58. I'll make a few more current snapshots to test on as well.

@jbj jbj removed this from the 1.18 milestone Sep 7, 2018
@jbj
Copy link
Contributor

jbj commented Sep 7, 2018

I've taken this out of the 1.18 milestone as it keeps slipping, and I want to make sure the team focus is on testing and stabilising 1.18 rather than adding this feature. I still hope we can merge this soon, but testing 1.18 should take priority over this for both me and @rdmarsh2.

Please rebase to master and move the change note into the 1.19 change notes file (create if needed).

@rdmarsh2 rdmarsh2 changed the base branch from rc/1.18 to master September 7, 2018 17:21
@rdmarsh2 rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from 1e3bf85 to 0e44bf3 Compare September 10, 2018 19:23
@rdmarsh2
Copy link
Contributor Author

Moved change notes and added the LGTM import. I believe I've addressed all the review comments.

@jbj jbj merged commit 86fe0ce into github:master Sep 18, 2018
aibaars pushed a commit that referenced this pull request Oct 14, 2021
Add `--working-dir=.` to `index-files` call
smowton pushed a commit to smowton/codeql that referenced this pull request Dec 6, 2021
Remove external property related log messages
erik-krogh pushed a commit to erik-krogh/ql that referenced this pull request Dec 15, 2021
erik-krogh pushed a commit to erik-krogh/ql that referenced this pull request Dec 15, 2021
dbartol pushed a commit that referenced this pull request Dec 18, 2024
query: split if expression is always true query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants