Description
Here is a failing test case:
#[test]
fn bm_backstop_boundary() {
let haystack = b"\
// aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
e_data.clone_created(entity_id, entity_to_add.entity_id);
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
".to_vec();
let needle = b"clone_created".to_vec();
let searcher = BoyerMooreSearch::new(needle);
let result = searcher.find(&haystack);
assert_eq!(Some(43), result);
}
This was originally reported against ripgrep:
- ripgrep misses some matching lines BurntSushi/ripgrep#780
- ripgrep misses a match but succeeds with
-S
and-i
BurntSushi/ripgrep#781
Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in BurntSushi/ripgrep#781
I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the memchr
skip routine, which uses _
as its guard char. This takes window_end
from 12
to 44
, which is positioned at the l
in clone_created
. This then fails check_match
(which seems correct), but this in turn causes the md2_shift
(which is 12
) to get added to the window_end
index, which causes the code to skip ahead too far and miss the match.
The part that doesn't make sense to me here is the relationship between check_match
and the md2_shift
. Namely, check_match
appears to be used as a guard against whether md2_shift
is used. If the match fails, then the code assumes we can skip md2_shift
bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:
if self.check_match(haystack, window_end) {
return Some(window_end - (self.pattern.len() - 1));
} else if window_end + self.md2_shift >= backstop {
break;
} else {
window_end += self.md2_shift;
continue;
}
but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the md2_shift
could overshoot the match is the problem, but it's less clear how to fix that.
@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?