-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: HashCons library #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend checking that every expression gets exactly one HC
value. Just run a query like this on a few large databases:
If you already did that then LGTM.
There's a few categories of expressions that aren't covered yet: array literals will need a similar technique to function calls, struct literals may need something more complicated. New and delete can likely be done as an extension of the function call handling. Throw expressions should be simple to implement, as should sizeof, alignof, and company. I haven't decided whether assignments should be hash-consed or not; input would be appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat library. I believe there's demand for it, so let's try to get it in shape for merging before 1.18.
result = | ||
min(Expr e | ||
| this = hashCons(e) | ||
| e order by e.getLocation().toString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like have a comment and a test for what happens when multiple expressions have the same location because they come from a macro expansion. I'm not even sure whether the extractor produces one or two Location
objects in that case. Example:
#define SQUARE(x) ((x) * (x))
...
z = SQUARE(y+1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results there are very strange; it gives one location for the literal and two for the variable access and the addition. I think that's extractor weirdness, but it might be something in the library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear, that seems to be one location for each variable access in the macro expansion, with loc1 != loc2
, but loc1.subsumes(loc2)
and loc2.subsumes(loc1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test; I'm still not sure what's going on here. @nickrolfe @ian-semmle can you comment on what's happening in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look like the accesses to y
are duplicated, but somehow there's only one row in the .expected
file for the HashCons
object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.
} | ||
|
||
private predicate analyzableNonmemberFunctionCall( | ||
FunctionCall fc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many functions in this file have unnecessary newlines in their parameter lists. Our QL style guide currently says a line can be up to 100 columns wide, and I personally think that's what we'll stick with in the future. But even with an 80-column limit, this parameter list can fit one a line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* expression with this `HC` and using its `toString` and `getLocation` | ||
* methods. | ||
*/ | ||
class HC extends HCBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name HC
is far too short for a non-private class name, especially one that'll used infrequently. Why not call it HashCons
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was for symmetry with GVN from the global value numbering library. I've switched to HashCons for the public class name
* methods. | ||
*/ | ||
class HC extends HCBase { | ||
HC() { this instanceof HCBase } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This charpred can just be deleted. It only repeats what extends
already says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private predicate mk_PointerFieldAccess( | ||
HC qualifier, Field target, | ||
PointerFieldAccess access) { | ||
analyzablePointerFieldAccess(access) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation is off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
or | ||
HC_ThisExpr(Function fcn) { | ||
mk_ThisExpr(fcn,_) or | ||
mk_ImplicitThisFieldAccess(fcn,_,_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This library goes to great lengths to equate (*obj).field
with obj->field
. Instead of just having one case for field access, there are six predicates named like mk_*FieldAccess*
. The interaction between those six predicates and these three constructors is non-obvious and scarcely documented.
I don't understand why this one case of desugaring should receive special handling in this library. There is no attempt to equate a + 1
with 1 + a
, for example, or to equate *(a + i)
with a[i]
. Given that we have GlobalValueNumbering
for the more semantic applications, shouldn't we leave HashCons
to deal purely with surface syntax? That would simplify this code and make it easier to explain what this library does and does not do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a holdover from the GVN library. @kevinbackhouse can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like implicit this
is usually expanded in the extractor, so it may be better to leave that in place and add some notes about it, rather than making the treatment of an implicit this
be inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's rare to see an implicit this
in the AST. I think it happens only for calls to the destructors of fields in compiler-generated destructors. I don't think that's something we need to worry about for HashCons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's worth simplifying the hash-consing of field accesses. I don't think it's important that myField
gets equated with this.myField
as those two would never occur in the same function because the extractor inserts explicit this.
in user-written code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if it's that rare I'll take the logic out;.
} | ||
|
||
/** Gets the hash-cons of expression `e`. */ | ||
cached HC hashCons(Expr e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check whether this predicate ends up in the same cached stage as the HC
type. If not, we'll effectively compute this library twice. The cache layers can be found in the QL4E console log: search for "RESULTS IN" and see what's listed for each stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're in the same stage.
| mk_StringLiteral(val, t, e) and | ||
result = HC_StringLiteral(val, t)) | ||
or | ||
// Variable with no SSA information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment can be deleted. It's probably a copy-paste error from GVN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
/** | ||
* Holds if `fc` is a call to `fcn`, `fc`'s first `i-1` arguments have hash-cons | ||
* `list`, and `fc`'s `i`th argument has hash-cons `hc` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is off by one. Where it says i-1
it should just say i
. Take, for example, the case of i=0
. Then it should say that "fc
's first 0 arguments have hash-cons list
, and fc
's 0th argument has hash-cons hc
.
Maybe also change "fc
's i
th argument" to "fc
's argument at index i
" or clarify in some other way that the first argument has i=0
. Also add a full stop at the end of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
HC_Unanalyzable(Expr e) { not analyzableExpr(e,_) } | ||
|
||
/** Used to implement hash-consing of argument lists */ | ||
private cached newtype HC_Args = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by comment: When I looked into implementing a similar library for JavaScript, I got a considerable speed-up from adding special-case constructors for argument lists of length one and two. Have you investigated whether this makes a difference for C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't. I'll take a look at that, though.
3c69e1d
to
7b494c0
Compare
forall(int i | exists(fc.getArgument(i)) | strictcount(fc.getArgument(i).getFullyConverted()) = 1) and | ||
forall(int i | | ||
exists(fc.getArgument(i)) | | ||
strictcount(fc.getArgument(i).getFullyConverted()) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exists
has misleading indentation.
result = | ||
min(Expr e | ||
| this = hashCons(e) | ||
| e order by e.getLocation().toString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look like the accesses to y
are duplicated, but somehow there's only one row in the .expected
file for the HashCons
object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a lot of changes. I could have easily missed something in this review, so please read the code thoroughly yourself as well.
Given the extent and complexity of this library, you'll also need to add a sanity test in the style of https://github.com/Semmle/ql/blob/master/cpp/ql/test/library-tests/valuenumbering/GlobalValueNumbering/Uniqueness.ql and run it on some large snapshots. That will also give you an opportunity to re-check performance after this last round of updates.
mk_NewExpr(t, alloc, init, align, _, _) | ||
} or | ||
HC_NewArrayExpr(Type t, HC_Alloc alloc, HC_Init init, HC_Align align) { | ||
mk_NewArrayExpr(t, alloc, init, align, _, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also include the number of elements allocated (NewArrayExpr.getExtent
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no; the type of the newArrayExpr is an ArrayType, so the size is included there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can its size be included there if it's not constant but an arbitrary expression? I can see you have a test for it (new int[x]
!= new int[z]
), and that it appears to work, but how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now I see why it's working. You added a commit to fix it.
not exists(new.getInitializer()) | ||
or | ||
strictcount(new.getInitializer()) = 1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can save 7 lines here by writing either count(new.getAllocatorCall()) <= 1
or not strictcount(new.getAllocatorCall()) > 1
and the same for getInitializer
. I think the first version is clearer, and it should desugar to almost exactly the disjunction you've already written.
|
||
private predicate analyzableNewArrayExpr(NewArrayExpr new) { | ||
strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and | ||
strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines look identical.
not exists(new.getInitializer()) | ||
or | ||
strictcount(new.getInitializer().getFullyConverted()) = 1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggested as in analyzableNewExpr
.
fc.getNumberOfArguments() = 2 | ||
or | ||
aligned = false and | ||
fc.getNumberOfArguments() = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic with how the argument count differs depending on the allocation being aligned is repeated four times in this file. I think I've also seen it in the IR construction. Can you find a way to move into the AST classes so users of those classes don't have to know about these magic numbers?
} | ||
|
||
private predicate mk_HasAlign(HashCons hc, NewOrNewArrayExpr new) { | ||
hc = hashCons(new.getAlignmentArgument()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have .getFullyConverted()
on it? Also in the two other places where getAlignmentArgument
is called.
} | ||
|
||
private predicate analyzableNewExpr(NewExpr new) { | ||
strictcount(new.getAllocatedType()) = 1 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should also be a check for the count of getAlignmentArgument.getFullyConverted()
.
private newtype HC_Array = | ||
HC_EmptyArray(Type t) { | ||
exists(ArrayAggregateLiteral aal | | ||
aal.getType() = t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only call to getType
without a call to getUnspecifiedType
on it. Is that on purpose?
* Used to implement hash-consing of struct initizializers. | ||
*/ | ||
private newtype HC_Fields = | ||
HC_EmptyFields(Class c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This library has now grown several similar "list of HashCons
" types: HC_Args
, HC_Fields
, HC_Array
, etc. I wonder if they could all be replaced with one type, similar to HC_Args
but with the "cons" constructor body being a disjunction of all the relevant mk_
predicates. I'll leave it up to you whether that would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HC_Fields and HC_Array may need to be separate to handle designated initializers, depending on what decision we make about them
} | ||
|
||
private predicate analyzableTypeidType(TypeidOperator e) { | ||
strictcount(e.getAChild()) = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison is always false as strictcount
never returns 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did Uniqueness.ql
and performance work out on large snapshots?
Remember to import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.
Are there any other unaddressed PR comments left?
#116 has been merged now, so I assume you want to sync up with that. |
fce931a
to
cd1403b
Compare
Should this be mentioned in the analysis change notes for 1.18? |
@felicity-semmle Yes, if it makes it into 1.18. |
change-notes/1.18/analysis-cpp.md
Outdated
@@ -33,4 +33,4 @@ | |||
|
|||
## Changes to QL libraries | |||
|
|||
* *Series of bullet points* | |||
* Added a hash consing library for structural comparison of expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer readers to where they can find the library. Maybe instead write "Added a new library semmle.code.cpp.valuenumbering.HashCons
for structural comparison of expressions."
Also, this file is in conflict.
0c391a5
to
6e2f96b
Compare
I've cleared cache, then run the GVN Uniqueness.ql test to refill the cache, and then run the HashCons Uniqueness.ql to see the actual performance effects. ChakraCore spent 201 seconds in evaluation, lepton spent 58. I'll make a few more current snapshots to test on as well. |
I've taken this out of the 1.18 milestone as it keeps slipping, and I want to make sure the team focus is on testing and stabilising 1.18 rather than adding this feature. I still hope we can merge this soon, but testing 1.18 should take priority over this for both me and @rdmarsh2. Please rebase to |
This makes two changes to how example exprs are selected. Example exprs are now ordered separately by each piece of the location, rather than by stringifying their location. Second, UnknownLocations are now ordered after locations with absolute paths, by using "~" in the lexicographic comparison of absolute paths. I think this works on both POSIX and Windows systems, but it's possible I'm missing a way to start an absolute path with a unicode character.
1e3bf85
to
0e44bf3
Compare
Moved change notes and added the LGTM import. I believe I've addressed all the review comments. |
Add `--working-dir=.` to `index-files` call
Remove external property related log messages
Add ql/missing-qldoc query.
Add ql/missing-qldoc query.
query: split if expression is always true query
A structural comparison library based on the hashconsing strategy for comparison. The implementation is similar to the existing global value numbering library, plus some tricky handling for function arguments.