C++: HashCons library #107

rdmarsh2 · 2018-08-25T00:20:58Z

A structural comparison library based on the hashconsing strategy for comparison. The implementation is similar to the existing global value numbering library, plus some tricky handling for function arguments.

ghost · 2018-08-25T00:21:07Z

All committers have signed the CLA.

kevinbackhouse

I recommend checking that every expression gets exactly one HC value. Just run a query like this on a few large databases:

https://github.com/Semmle/ql/blob/master/cpp/ql/test/library-tests/valuenumbering/GlobalValueNumbering/Uniqueness.ql

If you already did that then LGTM.

rdmarsh2 · 2018-08-25T01:12:28Z

There's a few categories of expressions that aren't covered yet: array literals will need a similar technique to function calls, struct literals may need something more complicated. New and delete can likely be done as an extension of the function call handling. Throw expressions should be simple to implement, as should sizeof, alignof, and company. I haven't decided whether assignments should be hash-consed or not; input would be appreciated.

jbj

Neat library. I believe there's demand for it, so let's try to get it in shape for merging before 1.18.

jbj · 2018-08-27T09:53:46Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+    result =
+       min(Expr e
+       | this = hashCons(e)
+       | e order by e.getLocation().toString())


I'd like have a comment and a test for what happens when multiple expressions have the same location because they come from a macro expansion. I'm not even sure whether the extractor produces one or two Location objects in that case. Example:

#define SQUARE(x) ((x) * (x)) ... z = SQUARE(y+1)

The results there are very strange; it gives one location for the literal and two for the variable access and the addition. I think that's extractor weirdness, but it might be something in the library

to be clear, that seems to be one location for each variable access in the macro expansion, with loc1 != loc2, but loc1.subsumes(loc2) and loc2.subsumes(loc1)

added a test; I'm still not sure what's going on here. @nickrolfe @ian-semmle can you comment on what's happening in this case?

It does look like the accesses to y are duplicated, but somehow there's only one row in the .expected file for the HashCons object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.

jbj · 2018-08-27T09:58:19Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+}
+
+private predicate analyzableNonmemberFunctionCall(
+  FunctionCall fc) {


Many functions in this file have unnecessary newlines in their parameter lists. Our QL style guide currently says a line can be up to 100 columns wide, and I personally think that's what we'll stick with in the future. But even with an 80-column limit, this parameter list can fit one a line.

jbj · 2018-08-27T10:02:37Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+ * expression with this `HC` and using its `toString` and `getLocation`
+ * methods.
+ */
+class HC extends HCBase {


I think the name HC is far too short for a non-private class name, especially one that'll used infrequently. Why not call it HashCons?

I think that was for symmetry with GVN from the global value numbering library. I've switched to HashCons for the public class name

jbj · 2018-08-27T10:06:51Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+ * methods.
+ */
+class HC extends HCBase {
+  HC() { this instanceof HCBase }


This charpred can just be deleted. It only repeats what extends already says.

jbj · 2018-08-27T11:00:29Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+private predicate mk_PointerFieldAccess(
+  HC qualifier, Field target,
+  PointerFieldAccess access) {
+   analyzablePointerFieldAccess(access) and


This indentation is off.

jbj · 2018-08-27T11:39:57Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+  or
+  HC_ThisExpr(Function fcn) {
+    mk_ThisExpr(fcn,_) or
+    mk_ImplicitThisFieldAccess(fcn,_,_)


This library goes to great lengths to equate (*obj).field with obj->field. Instead of just having one case for field access, there are six predicates named like mk_*FieldAccess*. The interaction between those six predicates and these three constructors is non-obvious and scarcely documented.

I don't understand why this one case of desugaring should receive special handling in this library. There is no attempt to equate a + 1 with 1 + a, for example, or to equate *(a + i) with a[i]. Given that we have GlobalValueNumbering for the more semantic applications, shouldn't we leave HashCons to deal purely with surface syntax? That would simplify this code and make it easier to explain what this library does and does not do.

I think that's a holdover from the GVN library. @kevinbackhouse can you confirm?

Yes, that's correct.

It looks like implicit this is usually expanded in the extractor, so it may be better to leave that in place and add some notes about it, rather than making the treatment of an implicit this be inconsistent.

It's rare to see an implicit this in the AST. I think it happens only for calls to the destructors of fields in compiler-generated destructors. I don't think that's something we need to worry about for HashCons.

I still think it's worth simplifying the hash-consing of field accesses. I don't think it's important that myField gets equated with this.myField as those two would never occur in the same function because the extractor inserts explicit this. in user-written code.

Yeah, if it's that rare I'll take the logic out;.

jbj · 2018-08-27T11:44:54Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+}
+
+/** Gets the hash-cons of expression `e`. */
+cached HC hashCons(Expr e) {


Please check whether this predicate ends up in the same cached stage as the HC type. If not, we'll effectively compute this library twice. The cache layers can be found in the QL4E console log: search for "RESULTS IN" and see what's listed for each stage.

They're in the same stage.

jbj · 2018-08-27T11:45:14Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+  | mk_StringLiteral(val, t, e) and
+    result = HC_StringLiteral(val, t))
+  or
+  // Variable with no SSA information.


This comment can be deleted. It's probably a copy-paste error from GVN.

jbj · 2018-08-27T12:02:02Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+
+/**
+ * Holds if `fc` is a call to `fcn`, `fc`'s first `i-1` arguments have hash-cons
+ * `list`, and `fc`'s `i`th argument has hash-cons `hc`


I think this comment is off by one. Where it says i-1 it should just say i. Take, for example, the case of i=0. Then it should say that "fc's first 0 arguments have hash-cons list, and fc's 0th argument has hash-cons hc.

Maybe also change "fc's ith argument" to "fc's argument at index i" or clarify in some other way that the first argument has i=0. Also add a full stop at the end of the sentence.

xiemaisi · 2018-08-27T12:38:59Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+  HC_Unanalyzable(Expr e) { not analyzableExpr(e,_) }
+
+/** Used to implement hash-consing of argument lists  */
+private cached newtype HC_Args =


Drive-by comment: When I looked into implementing a similar library for JavaScript, I got a considerable speed-up from adding special-case constructors for argument lists of length one and two. Have you investigated whether this makes a difference for C++?

I haven't. I'll take a look at that, though.

jbj · 2018-08-28T06:43:39Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

-  forall(int i | exists(fc.getArgument(i)) | strictcount(fc.getArgument(i).getFullyConverted()) = 1) and
+  forall(int i |
+  exists(fc.getArgument(i)) |
+    strictcount(fc.getArgument(i).getFullyConverted()) = 1


The exists has misleading indentation.

jbj · 2018-08-28T07:01:52Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+    result =
+       min(Expr e
+       | this = hashCons(e)
+       | e order by e.getLocation().toString())


It does look like the accesses to y are duplicated, but somehow there's only one row in the .expected file for the HashCons object even though I'd expect it to have two locations. So it might be strange, but I suppose it's not a problem.

jbj

Please import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.

jbj

That was a lot of changes. I could have easily missed something in this review, so please read the code thoroughly yourself as well.

Given the extent and complexity of this library, you'll also need to add a sanity test in the style of https://github.com/Semmle/ql/blob/master/cpp/ql/test/library-tests/valuenumbering/GlobalValueNumbering/Uniqueness.ql and run it on some large snapshots. That will also give you an opportunity to re-check performance after this last round of updates.

jbj · 2018-08-30T07:59:08Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+    mk_NewExpr(t, alloc, init, align, _, _)
+  }  or
+  HC_NewArrayExpr(Type t, HC_Alloc alloc, HC_Init init, HC_Align align) {
+    mk_NewArrayExpr(t, alloc, init, align, _, _)


Shouldn't this also include the number of elements allocated (NewArrayExpr.getExtent)?

no; the type of the newArrayExpr is an ArrayType, so the size is included there.

How can its size be included there if it's not constant but an arbitrary expression? I can see you have a test for it (new int[x] != new int[z]), and that it appears to work, but how?

Oh, now I see why it's working. You added a commit to fix it.

jbj · 2018-08-30T08:05:31Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+    not exists(new.getInitializer())
+    or
+    strictcount(new.getInitializer()) = 1
+  )


I think you can save 7 lines here by writing either count(new.getAllocatorCall()) <= 1 or not strictcount(new.getAllocatorCall()) > 1 and the same for getInitializer. I think the first version is clearer, and it should desugar to almost exactly the disjunction you've already written.

jbj · 2018-08-30T08:08:17Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+
+private predicate analyzableNewArrayExpr(NewArrayExpr new) {
+  strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and
+  strictcount(new.getAllocatedType().getUnspecifiedType()) = 1 and


These two lines look identical.

jbj · 2018-08-30T08:08:41Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+    not exists(new.getInitializer())
+    or
+    strictcount(new.getInitializer().getFullyConverted()) = 1
+  )


Same suggested as in analyzableNewExpr.

jbj · 2018-08-30T08:11:45Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+        fc.getNumberOfArguments() = 2
+        or
+        aligned = false and
+        fc.getNumberOfArguments() = 1


This logic with how the argument count differs depending on the allocation being aligned is repeated four times in this file. I think I've also seen it in the IR construction. Can you find a way to move into the AST classes so users of those classes don't have to know about these magic numbers?

jbj · 2018-08-30T08:33:28Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+}
+
+private predicate mk_HasAlign(HashCons hc, NewOrNewArrayExpr new) {
+  hc = hashCons(new.getAlignmentArgument())


Should this have .getFullyConverted() on it? Also in the two other places where getAlignmentArgument is called.

jbj · 2018-08-30T08:35:13Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+}
+
+private predicate analyzableNewExpr(NewExpr new) {
+  strictcount(new.getAllocatedType()) = 1 and


I think there should also be a check for the count of getAlignmentArgument.getFullyConverted().

jbj · 2018-08-30T08:39:08Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+private newtype HC_Array =
+  HC_EmptyArray(Type t) {
+    exists(ArrayAggregateLiteral aal |
+      aal.getType() = t


This is the only call to getType without a call to getUnspecifiedType on it. Is that on purpose?

jbj · 2018-08-30T08:50:28Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+ * Used to implement hash-consing of struct initizializers.
+ */
+private newtype HC_Fields =
+  HC_EmptyFields(Class c) {


This library has now grown several similar "list of HashCons" types: HC_Args, HC_Fields, HC_Array, etc. I wonder if they could all be replaced with one type, similar to HC_Args but with the "cons" constructor body being a disjunction of all the relevant mk_ predicates. I'll leave it up to you whether that would be clearer.

HC_Fields and HC_Array may need to be separate to handle designated initializers, depending on what decision we make about them

jbj · 2018-08-30T09:00:34Z

cpp/ql/src/semmle/code/cpp/valuenumbering/HashCons.qll

+}
+
+private predicate analyzableTypeidType(TypeidOperator e) {
+  strictcount(e.getAChild()) = 0


This comparison is always false as strictcount never returns 0.

jbj

Did Uniqueness.ql and performance work out on large snapshots?

Remember to import this new library from https://github.com/Semmle/ql/blob/master/cpp/ql/src/filters/ImportAdditionalLibraries.ql so it'll be available in the LGTM query console.

Are there any other unaddressed PR comments left?

jbj · 2018-08-31T11:45:30Z

#116 has been merged now, so I assume you want to sync up with that.

felicitymay · 2018-09-04T14:12:39Z

Should this be mentioned in the analysis change notes for 1.18?

jbj · 2018-09-04T14:27:00Z

@felicity-semmle Yes, if it makes it into 1.18.

jbj · 2018-09-05T06:48:07Z

change-notes/1.18/analysis-cpp.md

@@ -33,4 +33,4 @@

 ## Changes to QL libraries

-* *Series of bullet points*
+* Added a hash consing library for structural comparison of expressions.


Please refer readers to where they can find the library. Maybe instead write "Added a new library semmle.code.cpp.valuenumbering.HashCons for structural comparison of expressions."

Also, this file is in conflict.

rdmarsh2 · 2018-09-06T22:24:29Z

I've cleared cache, then run the GVN Uniqueness.ql test to refill the cache, and then run the HashCons Uniqueness.ql to see the actual performance effects. ChakraCore spent 201 seconds in evaluation, lepton spent 58. I'll make a few more current snapshots to test on as well.

jbj · 2018-09-07T09:55:28Z

I've taken this out of the 1.18 milestone as it keeps slipping, and I want to make sure the team focus is on testing and stabilising 1.18 rather than adding this feature. I still hope we can merge this soon, but testing 1.18 should take priority over this for both me and @rdmarsh2.

Please rebase to master and move the change note into the 1.19 change notes file (create if needed).

This makes two changes to how example exprs are selected. Example exprs are now ordered separately by each piece of the location, rather than by stringifying their location. Second, UnknownLocations are now ordered after locations with absolute paths, by using "~" in the lexicographic comparison of absolute paths. I think this works on both POSIX and Windows systems, but it's possible I'm missing a way to start an absolute path with a unicode character.

rdmarsh2 · 2018-09-13T16:59:21Z

Moved change notes and added the LGTM import. I believe I've addressed all the review comments.

Add `--working-dir=.` to `index-files` call

Remove external property related log messages

Add ql/missing-qldoc query.

query: split if expression is always true query

kevinbackhouse reviewed Aug 25, 2018

View reviewed changes

rdmarsh2 added the WIP This is a work-in-progress, do not merge yet! label Aug 25, 2018

jbj reviewed Aug 27, 2018

View reviewed changes

xiemaisi reviewed Aug 27, 2018

View reviewed changes

rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from 3c69e1d to 7b494c0 Compare August 27, 2018 20:54

jbj reviewed Aug 28, 2018

View reviewed changes

rdmarsh2 added this to the 1.18 milestone Aug 29, 2018

jbj requested changes Aug 30, 2018

View reviewed changes

jbj reviewed Aug 31, 2018

View reviewed changes

rdmarsh2 changed the base branch from master to rc/1.18 August 31, 2018 16:14

rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from fce931a to cd1403b Compare August 31, 2018 16:17

rdmarsh2 removed the WIP This is a work-in-progress, do not merge yet! label Aug 31, 2018

jbj reviewed Sep 5, 2018

View reviewed changes

rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from 0c391a5 to 6e2f96b Compare September 5, 2018 15:55

felicitymay added C++ documentation labels Sep 5, 2018

jbj removed this from the 1.18 milestone Sep 7, 2018

rdmarsh2 changed the base branch from rc/1.18 to master September 7, 2018 17:21

kevinbackhouse and others added 4 commits September 10, 2018 12:22

C++: initial implementation of a HashCons library.

2d7109b

C++: first tests for HashCons

3c6a9c0

C++: add literal tests

8b8ec7c

C++: rename HashCons test

d8dc75a

rdmarsh2 added 20 commits September 10, 2018 12:22

C++: add support for enums in HashCons

e6314c5

C++: respond to PR comments

fede8d6

C++: HashCons for new, new[], sizeof, alignof

5549b6f

C++: fix handling of aligned allocators

8f446aa

C++: initial support for aggregate initializers

752f39b

C++: add HashCons for delete expressions

85cfb02

C++: HashCons for throw

8189798

C++: Hashcons tests for ArrayExpr

cfeed30

C++: Hashcons for ?:, ExprCall, and weird stuff

06a3e8f

C++: fix performance of argument hash-consing

246ae2d

C++: remove implicit this handling in HashCons

fa9eeea

C++: Simplify some code

9f476e5

C++: Simplify HashCons for new and handle extents

c42ecfe

fix HashCons for typeid of type

2d098fe

C++: typeid and noexcept fixes in HashCons

bbafcd9

C++: accept test output

166dba2

C++: change note for HashCons library

990bfb4

C++: Uniqueness fixes for HashCons

fb8ad93

C++: Add import for LGTM

0e44bf3

rdmarsh2 force-pushed the rdmarsh/cpp/HashCons branch from 1e3bf85 to 0e44bf3 Compare September 10, 2018 19:23

C++: migrate change note

1a14b13

jbj approved these changes Sep 18, 2018

View reviewed changes

jbj merged commit 86fe0ce into github:master Sep 18, 2018

aibaars pushed a commit that referenced this pull request Oct 14, 2021

Merge pull request #107 from github/hvitved/index-files-working-dir

6423ea3

Add `--working-dir=.` to `index-files` call

smowton pushed a commit to smowton/codeql that referenced this pull request Dec 6, 2021

Merge pull request github#107 from github/kotlin-reduce-extraction-noise

67f632c

Remove external property related log messages

erik-krogh pushed a commit to erik-krogh/ql that referenced this pull request Dec 15, 2021

Merge pull request github#107 from github/missing-qldoc

ee7ac53

Add ql/missing-qldoc query.

erik-krogh pushed a commit to erik-krogh/ql that referenced this pull request Dec 15, 2021

QL: Merge pull request github#107 from github/missing-qldoc

2f77b92

Add ql/missing-qldoc query.

dbartol pushed a commit that referenced this pull request Dec 18, 2024

Merge pull request #107 from github/query_if

55476af

query: split if expression is always true query

C++: HashCons library #107

C++: HashCons library #107

Conversation

rdmarsh2 commented Aug 25, 2018

ghost commented Aug 25, 2018 • edited by ghost Loading

kevinbackhouse left a comment

Choose a reason for hiding this comment

rdmarsh2 commented Aug 25, 2018

jbj left a comment

Choose a reason for hiding this comment

jbj Aug 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbj left a comment

Choose a reason for hiding this comment

jbj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbj left a comment

Choose a reason for hiding this comment

jbj commented Aug 31, 2018

felicitymay commented Sep 4, 2018

jbj commented Sep 4, 2018

Choose a reason for hiding this comment

rdmarsh2 commented Sep 6, 2018

jbj commented Sep 7, 2018

rdmarsh2 commented Sep 13, 2018

ghost commented Aug 25, 2018 •

edited by ghost

Loading

jbj Aug 27, 2018 •

edited

Loading