Skip to content

Add convenience methods for extracting bits of text #824

Closed
@orlp

Description

@orlp

A lot of my use cases for regex involve extracting some information from text with a simple pattern, where full parsing would be overkill. In this scenario I generally just have N subsections of the text I'm looking to extract (possibly many times), where N is a small constant. Some examples from the recent advent of code:

let re = Regex::new(r"target area:\s*x=(-?\d+)..(-?\d+), y=(-?\d+)..(-?\d+)")?;
let re = Regex::new(r"Player 1 starting position:\s*(\d)\s*Player 2 starting position:\s*(\d)\s*")?;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;

In such a scenario it is quite the pain to extract the information I want using the current API. It generally involves something like this, for the last example:

use anyhow::Context;
use itertools::Itertools;
let (status, x1, x2, y1, y2, z1, z2) = re
    .captures(&line)
    .context("match failed")?
    .iter()
    .skip(1)
    .map(|c| c.unwrap().as_str())
    .collect_tuple()
    .unwrap();

Note that I already had to include a third-party library itertools just to get collect_tuple, without that we would either have to start repeatedly calling next() or get(i), or invoke an unnecessary allocation, and in both cases an extra check is needed to ensure there weren't too many captures (indicating a programmer error).

I think that adding a convenience method for this incredibly common use-case would be nice. I would propose the following (with sample implementation):

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
    let caps = re.captures(text)?;
    let mut participating = caps.iter().flatten();
    let whole_match = participating.next().unwrap().as_str();
    let captured = [0; N].map(|_| participating.next().unwrap().as_str());
    assert!(participating.next().is_none(), "too many participating capture groups");
    Some((whole_match, captured))
}

And similarly extract_iter returning impl Iterator<Item = (&'t str, [&'t str; N])>.


Then, the cumbersome example above would become the elegant:

let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

Another example from the README:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for caps in re.captures_iter(TO_SEARCH) {
    // Note that all of the unwraps are actually OK for this regex
    // because the only way for the regex to match is if all of the
    // capture groups match. This is not true in general though!
    println!("year: {}, month: {}, day: {}",
                caps.get(1).unwrap().as_str(),
                caps.get(2).unwrap().as_str(),
                caps.get(3).unwrap().as_str());
}

This could become:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for (_date, [year, month, day]) in re.extract_iter(TO_SEARCH) {
    println!("year: {}, month: {}, day: {}", year, month, day);
}

Caveats

This API does panic, but I believe that this is fine. It will only ever panic due to a direct programmer error (mismatching possibly participating capture groups to the number of outputs), which would always have been wrong, and does not represent an exceptional but otherwise valid scenario. And for anyone for which this is not acceptable, they can always still use the original API.

Additionally, we call .flatten() on the capturing groups to hide any non-participating capturing groups. I do believe this is actually fine as well, allowing uses cases such as this, where we have an alternation with a similar capture group on either side, without wanting to capture the context:

let id_strings_re = Regex::new(r#""id:([^"]*)"|'id:([^']*)'"#)?;
let [id] = id_strings_re.extract(text).unwrap().1;

If this would be a welcome addition I'm more than happy to make a pull request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions