-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Ruby: type-tracking and API edges through simple library callables #10375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 46 potential problems in the proposed changes. Check the Files changed tab for more details.
9268c4b
to
e52553b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 46 potential problems in the proposed changes. Check the Files changed tab for more details.
e52553b
to
275b2a9
Compare
python/ql/lib/semmle/python/dataflow/new/internal/TypeTrackerSpecific.qll
Fixed
Show fixed
Hide fixed
d980c31
to
00bb877
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Asger! A few comments.
* Gets a node representing the `content` stored on the base object. | ||
*/ | ||
Node getContent(DataFlow::Content content) { | ||
result = this.getASuccessor(Label::content(content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we use ContentSet
s in the API graph instead, and then let this predicate use getAReadContent
when this
and Node
are use nodes, and getAStoreContent
when they are def nodes?
getAReadContent
can sometimes have a huge fan-out, which I would like to not have materialized in the graph itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is done like this exactly to avoid the fan-out from getAReadContent
.
Here's a break down of how the graph is generated and what ends up being matched:
Description | Edge label | Matched by |
---|---|---|
Use node reading a known element n |
TKnownElementContent(n) |
Element[n] and Element[any] |
Def node storing a known element n |
TKnownElementContent(n) |
Element[n] and Element[any] |
Use node reading an unknown element | TUnknownElementContent |
Element[?] and Element[any] |
Def node storing an unknown element | TUnknownElementContent |
Element[?] and Element[any] |
Element[n]
does not match TUnknownElementContent
, and the model must thus use Element[?,n]
if this behaviour is wanted. This seems consistent with how existing Ruby models are required to mention Element[?]
explicitly.
I believe this is the best integration we can hope for without changing how ContentSet
works.
I tried storing ContentSet
on the edge label but it doesn't really work out nicely. getAReadContent()
returns a Content
, but the edges are labelled with a ContentSet
so we have to map it back to a ContentSet
to look up in the edge
relation. To ensure an efficient join there we'd end up factoring out a relation indexed on the Content
, which basically corresponds the edge relation we're using now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After merging with #10574 I believe the table is updated as follows:
- the first two rows can also be matched by
Element[n!]
- the last two rows can also be matched by
Element[k]
for any constantk
Description | Edge label | Matched by |
---|---|---|
Use node reading a known element n |
TKnownElementContent(n) |
Element[n] , Element[n!] , and Element[any] |
Def node storing a known element n |
TKnownElementContent(n) |
Element[n] , Element[n!] , and Element[any] |
Use node reading an unknown element | TUnknownElementContent |
Element[k] for any k , Element[?] and Element[any] |
Def node storing an unknown element | TUnknownElementContent |
Element[k] for any k , Element[?] and Element[any] |
- Previously it was the model's responsibility to mention
Element[?]
if unknown reads/stores should be found. - Now models are now responsible for using
Element[n!]
if this behaviour is not wanted.
Again this seems consistent with how Ruby models work after #10574, so I'm cautiously optimistic that everything just works after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
d5a06ad
to
51ceef8
Compare
Thanks for the review so far @hvitved. I have rebased on top the of recent call graph changes, resolving a conflict with #10531, and moved the change to The first new commit is |
We've got quite a few unresolved threads open. @hvitved could you 'mark as resolved' any conversations you feel have been addressed. From my perspective the two outstanding items are:
|
Done. |
0275ad5
to
463c6ca
Compare
Force-pushed to remove some revert commits that I had accidentally pushed as part of the investigation in this thread. |
fixup qldoc in OptionalTypeTrckerContent
The optimizations done here now seem to backfire and cause more problems than they fix.
463c6ca
to
e56630a
Compare
Rebased to fix conflict in |
This only fixes superficial conflicts with github#10574 semantic conflicts will be addressed in later commits
Merging with #10574 went smoother than expected, and at first glance it seems no real changes are necessary. I take that as an indicator that we're doing it right, namely type-trackers storing a Running another evaluation, hoping it doesn't prove me wrong 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two last comments (sorry for not catching earlier).
The evaluation shows a slightly increased cost for ruby/ruby after merging with #10574. I think part of the reason is that there are about twice as many content sets now, and ruby/ruby has a lot of content sets. I tried restricting the content sets we care about with this commit and although the tuple counts showed a great reduction, it did not seem to help. I'd prefer to move ahead with the PR as it is. |
I do like that commit, in particular because private predicate compatibleStoreLoadContents(TypeTrackerContent storeContents, TypeTrackerContent loadContents) {
basicStoreStep(_, _, storeContents) and
basicLoadStep(_, _, loadContents) and
compatibleContents(storeContents, loadContents)
}
...
StoreStep(TypeTrackerContent content) { compatibleStoreLoadContents(content, _) } or
LoadStep(TypeTrackerContent content) { compatibleStoreLoadContents(_, content) } or |
I measured the size of
Unfortunately neither version is compatible with I've pushed another commit that makes the restriction on type-tracker compatible with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you hard work, and patience, on this one!
Evaluation looks similar to the last one. The last Python evaluation shows a median slow-down of 1% which is a bummer given that this PR does nothing for Python. We can't do langauge-dependent join order tweaks in shared code, and the best join order is slightly different since Ruby has |
Another evaluation of Python including the latest changes shows the same result, and I got a 👍 from the Python team that we're ok with this. |
This PR ensures that simple library callables whose callee does not depend on API graphs can be seen by type-tracking and API graphs. "Simple" in this case means it can be reduced to a store or load edge at the call site, though in the future we may want to add load-store and basic edges as well.
For example, a summary of form
Argument[0].Element[0];ReturnValue
will be seen as:Element[0]
, andElement[0]
.Ruby is the first language to use the string-based flow summary approach combined with type-tracking and API graphs, but the two systems hadn't been fully integrated. In particular, Ruby has so far been modelling the standard library using summaries that are invisible to type-tracking and API graphs - this would lead to incompleteness compared to just generating the flow edges directly. I believe this PR bridges the gap, so we can get the best of both worlds while modelling everything with flow summaries.
Evaluation shows 216 new call edges, at a modest performance cost. I've investigated the performance overhead a bit, but couldn't get it any lower than it is now - partly because we are just tracking more flow now.
I'm running an evaluation for Python as well, since some shared code was affected.