Skip to content

tuned boyer moore fails a match case #446

Closed
@BurntSushi

Description

@BurntSushi

Here is a failing test case:

    #[test]
    fn bm_backstop_boundary() {
        let haystack = b"\
// aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
e_data.clone_created(entity_id, entity_to_add.entity_id);
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
".to_vec();
        let needle = b"clone_created".to_vec();

        let searcher = BoyerMooreSearch::new(needle);
        let result = searcher.find(&haystack);
        assert_eq!(Some(43), result);
    }

This was originally reported against ripgrep:

Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in BurntSushi/ripgrep#781

I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the memchr skip routine, which uses _ as its guard char. This takes window_end from 12 to 44, which is positioned at the l in clone_created. This then fails check_match (which seems correct), but this in turn causes the md2_shift (which is 12) to get added to the window_end index, which causes the code to skip ahead too far and miss the match.

The part that doesn't make sense to me here is the relationship between check_match and the md2_shift. Namely, check_match appears to be used as a guard against whether md2_shift is used. If the match fails, then the code assumes we can skip md2_shift bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:

                if self.check_match(haystack, window_end) {
                    return Some(window_end - (self.pattern.len() - 1));
                } else if window_end + self.md2_shift >= backstop {
                    break;
                } else {
                    window_end += self.md2_shift;
                    continue;
                }

but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the md2_shift could overshoot the match is the problem, but it's less clear how to fix that.

@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions