Skip to content

Ruby: type-tracking and API edges through simple library callables #10375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Sep 30, 2022

Conversation

asgerf
Copy link
Contributor

@asgerf asgerf commented Sep 12, 2022

This PR ensures that simple library callables whose callee does not depend on API graphs can be seen by type-tracking and API graphs. "Simple" in this case means it can be reduced to a store or load edge at the call site, though in the future we may want to add load-store and basic edges as well.

For example, a summary of form Argument[0].Element[0];ReturnValue will be seen as:

  • a load step in type-tracking, reading the content set corresponding to Element[0], and
  • an edge in the API graph labelled with the content corresponding to Element[0].

Ruby is the first language to use the string-based flow summary approach combined with type-tracking and API graphs, but the two systems hadn't been fully integrated. In particular, Ruby has so far been modelling the standard library using summaries that are invisible to type-tracking and API graphs - this would lead to incompleteness compared to just generating the flow edges directly. I believe this PR bridges the gap, so we can get the best of both worlds while modelling everything with flow summaries.

Evaluation shows 216 new call edges, at a modest performance cost. I've investigated the performance overhead a bit, but couldn't get it any lower than it is now - partly because we are just tracking more flow now.

I'm running an evaluation for Python as well, since some shared code was affected.

@asgerf asgerf added no-change-note-required This PR does not need a change note Ruby labels Sep 12, 2022
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 46 potential problems in the proposed changes. Check the Files changed tab for more details.

@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch from 9268c4b to e52553b Compare September 12, 2022 10:09
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 46 potential problems in the proposed changes. Check the Files changed tab for more details.

@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch from e52553b to 275b2a9 Compare September 12, 2022 10:55
@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch from d980c31 to 00bb877 Compare September 13, 2022 09:58
@asgerf asgerf marked this pull request as ready for review September 13, 2022 13:47
@asgerf asgerf requested review from a team as code owners September 13, 2022 13:47
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Asger! A few comments.

* Gets a node representing the `content` stored on the base object.
*/
Node getContent(DataFlow::Content content) {
result = this.getASuccessor(Label::content(content))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use ContentSets in the API graph instead, and then let this predicate use getAReadContent when this and Node are use nodes, and getAStoreContent when they are def nodes?

getAReadContent can sometimes have a huge fan-out, which I would like to not have materialized in the graph itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is done like this exactly to avoid the fan-out from getAReadContent.

Here's a break down of how the graph is generated and what ends up being matched:

Description Edge label Matched by
Use node reading a known element n TKnownElementContent(n) Element[n] and Element[any]
Def node storing a known element n TKnownElementContent(n) Element[n] and Element[any]
Use node reading an unknown element TUnknownElementContent Element[?] and Element[any]
Def node storing an unknown element TUnknownElementContent Element[?] and Element[any]

Element[n] does not match TUnknownElementContent, and the model must thus use Element[?,n] if this behaviour is wanted. This seems consistent with how existing Ruby models are required to mention Element[?] explicitly.

I believe this is the best integration we can hope for without changing how ContentSet works.

I tried storing ContentSet on the edge label but it doesn't really work out nicely. getAReadContent() returns a Content, but the edges are labelled with a ContentSet so we have to map it back to a ContentSet to look up in the edge relation. To ensure an efficient join there we'd end up factoring out a relation indexed on the Content, which basically corresponds the edge relation we're using now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging with #10574 I believe the table is updated as follows:

  • the first two rows can also be matched by Element[n!]
  • the last two rows can also be matched by Element[k] for any constant k
Description Edge label Matched by
Use node reading a known element n TKnownElementContent(n) Element[n], Element[n!], and Element[any]
Def node storing a known element n TKnownElementContent(n) Element[n], Element[n!], and Element[any]
Use node reading an unknown element TUnknownElementContent Element[k] for any k, Element[?] and Element[any]
Def node storing an unknown element TUnknownElementContent Element[k] for any k, Element[?] and Element[any]
  • Previously it was the model's responsibility to mention Element[?] if unknown reads/stores should be found.
  • Now models are now responsible for using Element[n!] if this behaviour is not wanted.

Again this seems consistent with how Ruby models work after #10574, so I'm cautiously optimistic that everything just works after this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@calumgrant calumgrant requested review from tausbn and yoff September 20, 2022 09:08
@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch 3 times, most recently from d5a06ad to 51ceef8 Compare September 22, 2022 12:23
@asgerf
Copy link
Contributor Author

asgerf commented Sep 22, 2022

Thanks for the review so far @hvitved.

I have rebased on top the of recent call graph changes, resolving a conflict with #10531, and moved the change to trackUseNode later in history to make it easier to evaluation in isolation.

The first new commit is Ruby: expand test case to reveal mismatching forward/backward flow.

@asgerf
Copy link
Contributor Author

asgerf commented Sep 22, 2022

New evaluation

@asgerf
Copy link
Contributor Author

asgerf commented Sep 26, 2022

We've got quite a few unresolved threads open. @hvitved could you 'mark as resolved' any conversations you feel have been addressed.

From my perspective the two outstanding items are:

@hvitved
Copy link
Contributor

hvitved commented Sep 26, 2022

@hvitved could you 'mark as resolved' any conversations you feel have been addressed.

Done.

@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch from 0275ad5 to 463c6ca Compare September 27, 2022 11:52
@asgerf
Copy link
Contributor Author

asgerf commented Sep 27, 2022

Force-pushed to remove some revert commits that I had accidentally pushed as part of the investigation in this thread.

@asgerf asgerf force-pushed the rb/summarize-loads-v2 branch from 463c6ca to e56630a Compare September 28, 2022 08:49
@asgerf
Copy link
Contributor Author

asgerf commented Sep 28, 2022

Rebased to fix conflict in DataFlowDispatch.qll: the call to getACallSimple now appears in viableLibraryCallable.

@asgerf
Copy link
Contributor Author

asgerf commented Sep 28, 2022

Merging with #10574 went smoother than expected, and at first glance it seems no real changes are necessary. I take that as an indicator that we're doing it right, namely type-trackers storing a ContentSet and API graph edges labelled with Content.

Running another evaluation, hoping it doesn't prove me wrong 😄

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two last comments (sorry for not catching earlier).

@asgerf
Copy link
Contributor Author

asgerf commented Sep 29, 2022

The evaluation shows a slightly increased cost for ruby/ruby after merging with #10574. I think part of the reason is that there are about twice as many content sets now, and ruby/ruby has a lot of content sets. I tried restricting the content sets we care about with this commit and although the tuple counts showed a great reduction, it did not seem to help. I'd prefer to move ahead with the PR as it is.

@hvitved
Copy link
Contributor

hvitved commented Sep 29, 2022

I tried restricting the content sets we care about with this commit and although the tuple counts showed a great reduction, it did not seem to help. I'd prefer to move ahead with the PR as it is.

I do like that commit, in particular because TStepSummary is cached. We could even make the two branches even smaller via

private predicate compatibleStoreLoadContents(TypeTrackerContent storeContents, TypeTrackerContent loadContents) {
  basicStoreStep(_, _, storeContents) and
  basicLoadStep(_, _, loadContents) and
  compatibleContents(storeContents, loadContents)
}
...
    StoreStep(TypeTrackerContent content) { compatibleStoreLoadContents(content, _)  } or
    LoadStep(TypeTrackerContent content) { compatibleStoreLoadContents(_, content) } or

@asgerf
Copy link
Contributor Author

asgerf commented Sep 29, 2022

I measured the size of StepSummary with each change on ruby/ruby:

Commit Size
Baseline dc03557 4,111,472
Size restriction from asgerf@1c4c058 4,633
Additional size restriction from @hvitved's comment 4,428

Unfortunately neither version is compatible with TypeTracker.startInContent as a model might start type-tracking in a property that has a load in the database but not a store. The predicate isn't currently used in Ruby, but has uses in JS and Python.

I've pushed another commit that makes the restriction on type-tracker compatible with startInContent and start another evaluation (not that I expect it look better than last time).

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you hard work, and patience, on this one!

@asgerf
Copy link
Contributor Author

asgerf commented Sep 30, 2022

Evaluation looks similar to the last one.

The last Python evaluation shows a median slow-down of 1% which is a bummer given that this PR does nothing for Python. We can't do langauge-dependent join order tweaks in shared code, and the best join order is slightly different since Ruby has ContentSet and Python doesn't.

@asgerf
Copy link
Contributor Author

asgerf commented Sep 30, 2022

Another evaluation of Python including the latest changes shows the same result, and I got a 👍 from the Python team that we're ok with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python Ruby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants