-
Notifications
You must be signed in to change notification settings - Fork 12k
common
: add partial regex support
#12808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't yet have time to look into details, but the code gives a good spirit 👍
Seems like this can be an opportunity to development some kind of fuzzing tool
auto it = pattern.begin(); | ||
const auto end = pattern.end(); | ||
|
||
std::function<std::string()> process = [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I think we should prevent using lambda function when possible (especially long functions), the bad thing is that it can increase compilation time by a lot.
A cleaner solution could be do have a struct reversed_partial_regex_builder
for it, move lambda into struct member, and then having this regex_to_reversed_partial_regex
construct a new instance of the struct each time you use it
common/common.h
Outdated
} | ||
// While we wait for C++20's std::string::ends_with... | ||
bool string_ends_with(const std::string & str, const std::string & suffix); | ||
size_t string_find_partial_stop(const std::string &str, const std::string &stop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t string_find_partial_stop(const std::string &str, const std::string &stop); | |
size_t string_find_partial_stop(const std::string & str, const std::string & stop); |
common/regex-partial.cpp
Outdated
while (it != end) { | ||
if (*it == '\\' && (++it != end)) { | ||
++it; | ||
} else if (*it == ']') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ++it == end
in the previous branch, here we will dereference the end()
iterator which is not allowed.
common/regex-partial.cpp
Outdated
} else if (*it == '\\' && (++it != end)) { | ||
auto str = std::string("\\") + *it; | ||
sequence->push_back(str); | ||
++it; | ||
} else { | ||
sequence->push_back(std::string(1, *it)); | ||
++it; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem with potentially dereferencing end()
common/common.cpp
Outdated
if (stop[char_index] == text_last_char) { | ||
const std::string current_partial = stop.substr(0, char_index + 1); | ||
if (string_ends_with(str, current_partial)) { | ||
return str.size() - char_index - 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these sub-string copies can be avoided via a std::string_view
overload of string_ends_with
.
common/regex-partial.h
Outdated
|
||
#include <regex> | ||
#include <string> | ||
#include "ggml.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of including ggml.h
, replace the GGML_ASSERT
with throw
.
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
This is a crude replacement for boost::match_partial, allowing to do partial find / match of regexes (if partial match is at the end of the tested string).
This was extracted from #12379 (streaming tool call support), in which it's used for partial parsers.
Rather than reimplementing an entire regex engine, I decided to take the lazy route of turning any pattern to a reverse pattern that matches from the string end, using the normal regex logic for the actual matching. The new
common_regex
may return full or partial matches. Full matches have their normal captures, while partial matches only have one capture group (the partial match).For instance, given the pattern
/abc/
, the reverse pattern is/((?:(?:c)?b)?a)[\s\S]*/
so it can match"fooo ab"
(and capture"ab"
; empty capture means no match at all).