Skip to content

common: add partial regex support #12808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 14, 2025
Merged

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Apr 7, 2025

This is a crude replacement for boost::match_partial, allowing to do partial find / match of regexes (if partial match is at the end of the tested string).

This was extracted from #12379 (streaming tool call support), in which it's used for partial parsers.

Rather than reimplementing an entire regex engine, I decided to take the lazy route of turning any pattern to a reverse pattern that matches from the string end, using the normal regex logic for the actual matching. The new common_regex may return full or partial matches. Full matches have their normal captures, while partial matches only have one capture group (the partial match).

For instance, given the pattern /abc/, the reverse pattern is /((?:(?:c)?b)?a)[\s\S]*/ so it can match "fooo ab" (and capture "ab"; empty capture means no match at all).

@github-actions github-actions bot added testing Everything test related examples server labels Apr 7, 2025
@ochafik ochafik marked this pull request as ready for review April 11, 2025 05:20
@ochafik ochafik requested a review from ngxson as a code owner April 11, 2025 05:20
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet have time to look into details, but the code gives a good spirit 👍

Seems like this can be an opportunity to development some kind of fuzzing tool

auto it = pattern.begin();
const auto end = pattern.end();

std::function<std::string()> process = [&]() {
Copy link
Collaborator

@ngxson ngxson Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I think we should prevent using lambda function when possible (especially long functions), the bad thing is that it can increase compilation time by a lot.

A cleaner solution could be do have a struct reversed_partial_regex_builder for it, move lambda into struct member, and then having this regex_to_reversed_partial_regex construct a new instance of the struct each time you use it

common/common.h Outdated
}
// While we wait for C++20's std::string::ends_with...
bool string_ends_with(const std::string & str, const std::string & suffix);
size_t string_find_partial_stop(const std::string &str, const std::string &stop);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t string_find_partial_stop(const std::string &str, const std::string &stop);
size_t string_find_partial_stop(const std::string & str, const std::string & stop);

while (it != end) {
if (*it == '\\' && (++it != end)) {
++it;
} else if (*it == ']') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ++it == end in the previous branch, here we will dereference the end() iterator which is not allowed.

Comment on lines 169 to 177
} else if (*it == '\\' && (++it != end)) {
auto str = std::string("\\") + *it;
sequence->push_back(str);
++it;
} else {
sequence->push_back(std::string(1, *it));
++it;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem with potentially dereferencing end()

Comment on lines 454 to 459
if (stop[char_index] == text_last_char) {
const std::string current_partial = stop.substr(0, char_index + 1);
if (string_ends_with(str, current_partial)) {
return str.size() - char_index - 1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these sub-string copies can be avoided via a std::string_view overload of string_ends_with.


#include <regex>
#include <string>
#include "ggml.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of including ggml.h, replace the GGML_ASSERT with throw.

@ochafik ochafik merged commit 3198405 into ggml-org:master May 14, 2025
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants