-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Reduce memory usage in AST parent map generation by lazily checking if nodes have been seen #129934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage in AST parent map generation by lazily checking if nodes have been seen #129934
Conversation
@llvm/pr-subscribers-clang Author: None (higher-performance) ChangesThis is a regression in #87824. With this change, the use of parent maps ( Fixes #129808 Full diff: https://github.com/llvm/llvm-project/pull/129934.diff 1 Files Affected:
diff --git a/clang/lib/AST/ParentMapContext.cpp b/clang/lib/AST/ParentMapContext.cpp
index e9387ec79c373..03e5236eeb1db 100644
--- a/clang/lib/AST/ParentMapContext.cpp
+++ b/clang/lib/AST/ParentMapContext.cpp
@@ -65,21 +65,78 @@ class ParentMapContext::ParentMap {
public:
ParentVector() = default;
explicit ParentVector(size_t N, const DynTypedNode &Value) {
- Items.reserve(N);
+ SortedAndUnsortedItems.reserve(N);
for (; N > 0; --N)
push_back(Value);
}
bool contains(const DynTypedNode &Value) {
- return Seen.contains(Value);
+ const auto SortBoundary = SortedAndUnsortedItems.begin() + NumSorted;
+ bool Found = std::binary_search(SortedAndUnsortedItems.begin(),
+ SortBoundary, Value);
+ Budget += llvm::bit_width(
+ static_cast<size_t>(SortBoundary - SortedAndUnsortedItems.begin()));
+ if (!Found) {
+ auto FoundIt =
+ std::find(SortBoundary, SortedAndUnsortedItems.end(), Value);
+ Budget += FoundIt - SortBoundary;
+ Found |= FoundIt != SortedAndUnsortedItems.end();
+ }
+ SortIfWorthwhile();
+ return Found;
}
void push_back(const DynTypedNode &Value) {
- if (!Value.getMemoizationData() || Seen.insert(Value).second)
- Items.push_back(Value);
+ ++Budget;
+ if (!Value.getMemoizationData() || !contains(Value)) {
+ SortedAndUnsortedItems.push_back(Value);
+ if (SortedAndUnsortedItems.back() < SortedAndUnsortedItems[NumSorted]) {
+ // Keep the minimum element in the middle to quickly tell us if
+ // merging will be necessary
+ using std::swap;
+ swap(SortedAndUnsortedItems.back(),
+ SortedAndUnsortedItems[NumSorted]);
+ }
+ }
+ VerifyInvariant();
}
- llvm::ArrayRef<DynTypedNode> view() const { return Items; }
+ llvm::ArrayRef<DynTypedNode> view() {
+ ++Budget;
+ return SortedAndUnsortedItems;
+ }
+
private:
- llvm::SmallVector<DynTypedNode, 2> Items;
- llvm::SmallDenseSet<DynTypedNode, 2> Seen;
+ void SortIfWorthwhile() {
+ VerifyInvariant();
+ auto SortBoundary = SortedAndUnsortedItems.begin() + NumSorted;
+ if (SortBoundary != SortedAndUnsortedItems.end()) {
+ const size_t NumUnsorted = SortedAndUnsortedItems.end() - SortBoundary;
+ const size_t SortingCost = NumUnsorted * llvm::bit_width(NumUnsorted);
+ const bool NeedMerge = SortBoundary != SortedAndUnsortedItems.begin();
+ // Assume that the naive implementation would copy these elements.
+ // This is just an estimate; it's OK if it's wrong.
+ const size_t MergeCost = SortedAndUnsortedItems.size() + NumUnsorted;
+ if (Budget >= (NeedMerge ? MergeCost : 0) + SortingCost) {
+ std::sort(SortBoundary, SortedAndUnsortedItems.end());
+ if (NeedMerge) {
+ std::inplace_merge(SortedAndUnsortedItems.begin(), SortBoundary,
+ SortedAndUnsortedItems.end());
+ }
+ Budget = 0;
+ NumSorted = SortedAndUnsortedItems.size();
+ }
+ }
+ }
+
+ void VerifyInvariant() const {
+ assert(
+ !(NumSorted < SortedAndUnsortedItems.size() &&
+ SortedAndUnsortedItems.back() <
+ SortedAndUnsortedItems[NumSorted]) &&
+ "the boundary item must always be the minimum of the unsorted items");
+ }
+
+ llvm::SmallVector<DynTypedNode, 2> SortedAndUnsortedItems;
+ size_t NumSorted = 0;
+ int64_t Budget = 0;
};
/// Maps from a node to its parents. This is used for nodes that have
@@ -117,7 +174,7 @@ class ParentMapContext::ParentMap {
if (I == Map.end()) {
return llvm::ArrayRef<DynTypedNode>();
}
- if (const auto *V = dyn_cast<ParentVector *>(I->second)) {
+ if (auto *V = dyn_cast<ParentVector *>(I->second)) {
return V->view();
}
return getSingleDynTypedNodeFromParentMap(I->second);
|
0e3c685
to
e99bb26
Compare
e99bb26
to
9f12c58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AaronBallman should take a look at this as well. We might take a look at SetVector
and see if there are any learnings we can take from that. It has a set and a vector together, but might have some interesting things.
Finally... I wonder if we were to just store this as a sorted-vector of DynTypedNode
+ insert order
, then do a copy & sort on access of view
.
Can we do some benchmarking to figure out which things are done the 'most often' with what values of N
? it might better guide this.
clang/lib/AST/ParentMapContext.cpp
Outdated
} | ||
void push_back(const DynTypedNode &Value) { | ||
if (!Value.getMemoizationData() || Seen.insert(Value).second) | ||
if (!Value.getMemoizationData() || !contains(Value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unfortunate to potentially re-calculate in contains
here to just potentially immediately re-calculate it. I wonder if there is value to special-case ItemsProcessed == 0 && Items.capacity() == Items.size()
to just do a linear search?
Also-also--- I wonder if it makes sense to only do the cache when size()
is significant enough? It would be interesting to see if there is a value of N under which the set doesn't really make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could definitely optimize these further like in the ways you mention, but those are just (very small) constant-factor savings in time, at a significant readability and fragility cost. I didn't get the feeling it's worth it to be honest -- do you? Note that the performance graphs almost coincide with Clang 17's right now.
Edit: Fixed below. The fix didn't affect performance. |
eaefea5
to
bcdf883
Compare
Ah sorry I forgot to reply to your review.
I did take a look at this actually, but it looked equivalent to what I was doing before, particularly with respect to storing the expensive keys (
I don't think that works. You don't want to do a ton of work on every Furthermore, not only is it that elements without memoization data aren't comparable, but they still need to appear in the The implementation in this PR addresses all of those constraints without any drawbacks, I think. There's less code compared to writing a facade, and performance-wise it feels just about optimal.
Re: benchmarking -- note that I did a general benchmark both with the test case that was provided in the associated issue, as well as with an expensive TU that we had internally. It had great performance in both cases. Regarding N -- I assume N is referring to the total container sizes (per the other comment)? The thing is, I don't get the impression there's much room for improvement here. The memory usage was just high by a constant (10x) factor. Some ~5x of that almost certainly came from the data type itself (storing |
I see. Yes, so the problem is that I don't have a good idea here of the uses, I did a first run here. If
Ok, I think I'm ok with all that then. BUT as I said, I'm not the best person to be reviewing this, I've pinged Aaron and hope he'll come along. |
clang/lib/AST/ParentMapContext.cpp
Outdated
const size_t OldCapacity = Items.capacity(); | ||
Items.push_back(Value); | ||
if (OldCapacity != Items.capacity()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite terrifying.
Would we impact performance significantly if we use a map <pointer , index>,
or a vector of pointer that has the same order as the vector of value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't compile without significant refactoring across several files. We need the Items
vector to be convertible to llvm::ArrayRef<DynTypedNode>
, because that's what view()
returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits from me so far.
I think there may be some additional performance benefits we could eek out, but I don't think they need to be done in this patch (or done at all, it's speculative). One is that I noticed ASTContext::getTraversalScope()
is returning a vector by copy and that doesn't seem necessary. Another is that perhaps we want to use a bloom filter here instead on the assumption that the parent map will always be fairly large. But again, this is all speculation.
Hmm... would SOME duplicates be acceptable? I THINK the original patch removed the duplicates, so they were presumably OK before then? IF that is acceptable (and uses are duplicate-tolerant), a Bloom filter that can reduce the number of duplicates sub-1% would still be a massive-win, right? We could perhaps do some tuning on the size of the filter to get that reasonable. According to the Wiki article on Bloom filter's intro, 10 bits-per-element is all that is necessary for a 1% false-positive rate, but the details of the article and reference get really "LaTEX-math-stuff" to me, so I don't have a good idea what it would take to get the false positive rate down low. BUT since the idea is to just reduce/eliminate duplicates without high memory overhead, it seems like it would be a great solution here. |
I noticed this too! I'm actually working on validating a RAC fix for this right now, just to have it return an ArrayRef instead. Both uses of that function just do observation, so it seems reasonable enough |
I think we are a bit confused, so can someone tell me if I'm wrong
Here are some random ideas
|
bcdf883
to
1bfd371
Compare
@cor3ntin We need to satisfy this constraint, which is not trivial: llvm-project/clang/lib/AST/ParentMapContext.cpp Lines 79 to 81 in c54afe5
i.e. if we stored maps or pointers then we could no longer return an |
Checking in - has everybody had a chance to review this? Can we merge? |
Was @cor3ntin correct here?
|
Depends which part you mean, but I don't believe so - I tried to respond to it here. Ultimately I was saying we can't have an intrusive linked list because of that constraint. The part about "just trying" to avoid duplicates is somewhat ambiguous; the code only seems to want to avoid duplicates nodes that carry memoization data, but not other nodes. It may just be an optimization but I'm not confident enough to say if duplicating those will cause a problem. The part about caring about parent order is true as I've mentioned earlier, and the part about N being small I just don't know -- it probably depends on the codebase and how many parents/ancestors a class has. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
I just realized there is another optimization we could do in We could avoid the duplication check in Thoughts on this? I'm inclined to give it a try to see if it's worth the code complexity. |
It sounds worth a try, and perhaps could help readability. I think a 'remove dupes' step (aka, std::unique/erase) is much more readable than what is being attempted here. As far as the |
Okay thanks, I'll give it a shot. It might take me a bit to get to it. If in the meantime anyone needs this PR urgently let me know and we can probably merge this and do that in a follow-up. |
0f2ccf1
to
a14050e
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
a14050e
to
6966e56
Compare
So, this is going to sound very silly, but it seems I may have missed the obvious, and the solution may have been extremely simple: The reason is that the original code only ever inserts into the vector whenever llvm-project/clang/lib/AST/ParentMapContext.cpp Lines 388 to 391 in a3c248d
Specifically, This drastically simplifies everything - please take another look and let me know if you see any issues! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable, I just have one question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the vector changed back to a SmallVector. This is a much cleaner change, thank you for spotting it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... this ends up just being a set-vector with a different identity being stored for the set. So this ends up being ~4 ptr-sizes (IIRC you said that DynTypedNode is 3x the size of a pointer) per-record (though the extra 1 is only for those with memoization data).
Did you run the space benchmark here from the original bug? I definitely believe that this is an improvement (changes from 6x-ptr/element to < 4x-ptr/element), and perhaps the deduplication makes up for the extra data (since it only has to remove ONE duplicate to make up for 3 elements).
AS FAR AS the SmallVector
/SmallDenseSet
vs std::vector
/std::set
, the decision here is interesting. SmallVector
has a pretty massive stack size in exchange for saving on an allocation (it isn't SSO-like, it is more, "size-of-vector + N, so that our allocation is on-stack unless size > N). So it kinda depends on whether we think MOST uses of this are going to be '1' element, vs '0' or '1+' (that is, for the small-vector suggested by @cor3ntin of size 1). If it is almost always going to be 1, I agree with him. If it is almost always going to be 0 or more than 1, std::vector is probably correct (as in the 0 case, std::vector is smaller, and in the 1+ case, the allocation is happening anyway, and we just have dead stack space again).
…ters in the cache
6966e56
to
d3923bd
Compare
Originally 5x (40 bytes vs. 8 bytes), not 3x. But it might be even higher now, since that was using
Yup. With the
Yeah I don't know what to expect from user code here, but it seems the difference is small either way -- I'd say let's just go with |
Just merged this, thanks everyone for the reviews! If I hadn't gotten the pushback earlier, I may have never spotted the simplification and optimization. |
/cherry-pick 8c7f0ea |
Error: Command failed due to missing milestone. |
/cherry-pick 8c7f0ea |
/pull-request #131209 |
…f nodes have been seen (llvm#129934) This mitigates a regression introduced in llvm#87824. The mitigation here is to store pointers the deduplicated AST nodes, rather than copies of the nodes themselves. This allows a pointer-optimized set to be used and saves a lot of memory because `clang::DynTypedNode` is ~5 times larger than a pointer. Fixes llvm#129808. (cherry picked from commit 8c7f0ea)
…f nodes have been seen (llvm#129934) This mitigates a regression introduced in llvm#87824. The mitigation here is to store pointers the deduplicated AST nodes, rather than copies of the nodes themselves. This allows a pointer-optimized set to be used and saves a lot of memory because `clang::DynTypedNode` is ~5 times larger than a pointer. Fixes llvm#129808.
This mitigates a regression introduced in #87824.
The mitigation here are two-fold:
We only store pointers to the vector elements, rather than copies of the elements themselves. This saves a lot of memory because the elements are ~5 times larger than pointers.
We populate the "seen" set lazily, whenever it is queried. This further reduces memory usage and speeds up queries.
Fixes #129808.