Skip to content

CTFE interning: don't walk allocations that don't need it #97585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 45 additions & 2 deletions compiler/rustc_const_eval/src/interpret/intern.rs
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,51 @@ impl<'rt, 'mir, 'tcx: 'mir, M: CompileTimeMachine<'mir, 'tcx, const_eval::Memory
mplace: &MPlaceTy<'tcx>,
fields: impl Iterator<Item = InterpResult<'tcx, Self::V>>,
) -> InterpResult<'tcx> {
// ZSTs cannot contain pointers, so we can skip them.
if mplace.layout.is_zst() {
// We want to walk the aggregate to look for references to intern. While doing that we
// also need to take special care of interior mutability.
//
// As an optimization, however, if the allocation does not contain any references: we don't
// need to do the walk. It can be costly for big arrays for example (e.g. issue #93215).
let is_walk_needed = |mplace: &MPlaceTy<'tcx>| -> InterpResult<'tcx, bool> {
// ZSTs cannot contain pointers, we can avoid the interning walk.
if mplace.layout.is_zst() {
return Ok(false);
}

// Now, check whether this allocation could contain references.
//
// Note, this check may sometimes not be cheap, so we only do it when the walk we'd like
// to avoid could be expensive: on the potentially larger types, arrays and slices,
// rather than on all aggregates unconditionally.
if matches!(mplace.layout.ty.kind(), ty::Array(..) | ty::Slice(..)) {
let Some((size, align)) = self.ecx.size_and_align_of_mplace(&mplace)? else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possible source of the regression that I can see is this function call invoking layout.field() to get the slice element's layout, which should also happen in the regular walk. But we only do that field layout computation if we're going to take the fast path here, so there should be no duplicate work.

Unsure how to debug this further

Copy link
Member Author

@lqd lqd Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valgrind have finally fixed the dwarf bug and I was able to run the CI artifacts to reproduce the issue it sees, that oli posted.

The only possible source of the regression that I can see

I wonder if that's the case. diesel is unusual here in that 3% of the check time is spent in try_fold_with and the likes, does the current PR change things related to folding at all @oli-obk ? (I would expect it not to).

Edit: After spending the better part of the day looking at this, I've noticed gathering my statistics (tracing rustc) was done over a diesel clean build (w/ dependencies and so on included), while rustc-perf just built the leaf crate. Unless I'm mistaken again, Diesel only has one aggregate that is walked during the check build benchmark, and it's a ZST hitting the early return... Making this regression either more puzzling, or easier to discard as noise.

The number of TyAndLayout::field() calls have apparently lowered (edit: on a clean build): 329512 on master and 326641 here. Were those you were wondering about ? (in the check benchmark I'm not seeing any change here by this PR)

I also wonder why this didn't show up in the previous run, if anything size_and_align_of_mplace is called <= as then. Just in case I'll try reverting back to the same commit and running the bot again to see what happens, to see if it's indeed noise.


I'm seeing some tiny cachegrind differences locally as well, but they are completely unrelated (and it seems unlikely it's about PGO) but could be about allocation: cachegrind attributes more time to ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms, the problem is that it does this here even for two different rustc worktrees for the same commit and same config.toml, making things though to analyze (I'll ignore this memcpy difference).

--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
2,497,210  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir         file:function
--------------------------------------------------------------------------------
2,442,872  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
  -81,685  ./malloc/arena.c:_int_free
   68,707  ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_erms

The visit_aggregate function in particular is pretty short, and some variance in the machine makes the 4945 calls' wall-time noisy (and doesn't register in cachegrind at all, neither does const interning apparently) like so:

  • master: 19ms
  • this PR: 16ms

Edit: the above are actually the clean build stats, not a leaf crate build like rustc-perf: IIUC they mostly come from dependencies (std/alloc, via proc-macros). The single call couldn't + early return make such a big regression.

I was about to post the actual flow of the 4945 aggregates to list the ZSTs, the early returns, unsized mplaces, get_ptr_alloc returning None and so on, but I don't think it's worth the time for what I believe is likely noise 😓 .

edit one million: and now after a rebase over master, the allocator noise I was seeing has switched to the other direction. Incredible.

--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
 -813,390  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir         file:function
--------------------------------------------------------------------------------
 -717,329  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms

// We do the walk if we can't determine the size of the mplace: we may be
// dealing with extern types here in the future.
return Ok(true);
};

// If there are no relocations in this allocation, it does not contain references
// that point to another allocation, and we can avoid the interning walk.
if let Some(alloc) = self.ecx.get_ptr_alloc(mplace.ptr, size, align)? {
if !alloc.has_relocations() {
return Ok(false);
}
} else {
// We're encountering a ZST here, and can avoid the walk as well.
return Ok(false);
}
}

// In the general case, we do the walk.
Ok(true)
};

// If this allocation contains no references to intern, we avoid the potentially costly
// walk.
//
// We can do this before the checks for interior mutability below, because only references
// are relevant in that situation, and we're checking if there are any here.
if !is_walk_needed(mplace)? {
return Ok(());
}

Expand Down
5 changes: 5 additions & 0 deletions compiler/rustc_const_eval/src/interpret/memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -942,6 +942,11 @@ impl<'tcx, 'a, Tag: Provenance, Extra> AllocRef<'a, 'tcx, Tag, Extra> {
.check_bytes(&self.tcx, self.range.subrange(range), allow_uninit, allow_ptr)
.map_err(|e| e.to_interp_error(self.alloc_id))?)
}

/// Returns whether the allocation has relocations for the entire range of the `AllocRef`.
pub(crate) fn has_relocations(&self) -> bool {
self.alloc.has_relocations(&self.tcx, self.range)
}
}

impl<'mir, 'tcx: 'mir, M: Machine<'mir, 'tcx>> InterpCx<'mir, 'tcx, M> {
Expand Down
17 changes: 11 additions & 6 deletions compiler/rustc_middle/src/mir/interpret/allocation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -537,21 +537,26 @@ impl<Tag: Provenance, Extra> Allocation<Tag, Extra> {
/// Relocations.
impl<Tag: Copy, Extra> Allocation<Tag, Extra> {
/// Returns all relocations overlapping with the given pointer-offset pair.
pub fn get_relocations(&self, cx: &impl HasDataLayout, range: AllocRange) -> &[(Size, Tag)] {
fn get_relocations(&self, cx: &impl HasDataLayout, range: AllocRange) -> &[(Size, Tag)] {
// We have to go back `pointer_size - 1` bytes, as that one would still overlap with
// the beginning of this range.
let start = range.start.bytes().saturating_sub(cx.data_layout().pointer_size.bytes() - 1);
self.relocations.range(Size::from_bytes(start)..range.end())
}

/// Returns whether this allocation has relocations overlapping with the given range.
///
/// Note: this function exists to allow `get_relocations` to be private, in order to somewhat
/// limit access to relocations outside of the `Allocation` abstraction.
///
pub fn has_relocations(&self, cx: &impl HasDataLayout, range: AllocRange) -> bool {
!self.get_relocations(cx, range).is_empty()
}

/// Checks that there are no relocations overlapping with the given range.
#[inline(always)]
fn check_relocations(&self, cx: &impl HasDataLayout, range: AllocRange) -> AllocResult {
if self.get_relocations(cx, range).is_empty() {
Ok(())
} else {
Err(AllocError::ReadPointerAsBytes)
}
if self.has_relocations(cx, range) { Err(AllocError::ReadPointerAsBytes) } else { Ok(()) }
}

/// Removes all relocations inside the given range.
Expand Down