tuned boyer moore fails a match case

Here is a failing test case:

```rust
    #[test]
    fn bm_backstop_boundary() {
        let haystack = b"\
// aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
e_data.clone_created(entity_id, entity_to_add.entity_id);
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
".to_vec();
        let needle = b"clone_created".to_vec();

        let searcher = BoyerMooreSearch::new(needle);
        let result = searcher.find(&haystack);
        assert_eq!(Some(43), result);
    }
```

This was originally reported against ripgrep:

* https://github.com/BurntSushi/ripgrep/issues/780
* https://github.com/BurntSushi/ripgrep/issues/781

Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in https://github.com/BurntSushi/ripgrep/issues/781

I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the `memchr` skip routine, which uses `_` as its guard char. This takes `window_end` from `12` to `44`, which is positioned at the `l` in `clone_created`. This then fails `check_match` (which seems correct), but this in turn causes the `md2_shift` (which is `12`) to get added to the `window_end` index, which causes the code to skip ahead too far and miss the match.

The part that doesn't make sense to me here is the relationship between `check_match` and the `md2_shift`. Namely, `check_match` appears to be used as a guard against whether `md2_shift` is used. If the match fails, then the code assumes we can skip `md2_shift` bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:

```rust
                if self.check_match(haystack, window_end) {
                    return Some(window_end - (self.pattern.len() - 1));
                } else if window_end + self.md2_shift >= backstop {
                    break;
                } else {
                    window_end += self.md2_shift;
                    continue;
                }
```

but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the `md2_shift` could overshoot the match is the problem, but it's less clear how to fix that.

@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now  because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tuned boyer moore fails a match case #446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tuned boyer moore fails a match case #446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions