[Bug regex/24316] New: regexec() violates POSIX longest match rule

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug regex/24316] New: regexec() violates POSIX longest match rule

glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=24316

            Bug ID: 24316
           Summary: regexec() violates POSIX longest match rule
           Product: glibc
           Version: 2.29
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: skvadrik at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Created attachment 11665
  --> https://sourceware.org/bugzilla/attachment.cgi?id=11665&action=edit
regexec_bug.c

Hi!

Not sure if this is a long-known issue, so reporting just in case.

Glibc implementation of regexec() violates  POSIX longest match rule:
http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html

    Consistent with the whole match being the longest
    of the leftmost matches, each subpattern, from left
    to right, matches the longest possible string.

For example, consider RE (a|ab)(bc|c) and string abc: first submatch group
should match (ab), and second group should match (c). However, glibc match is
different, as demonstrated by the following program (regexec_bug.c):

#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>

static void test(const char *pattern, const char *string)
{
    regex_t regex;
    int nmatch = 0, e = 0;
    regmatch_t *pmatch = 0;

    e = regcomp(&regex, pattern, REG_EXTENDED);
    if (e != 0) {
        fprintf(stderr, "regcomp failed for RE %s\n", pattern);
        goto end;
    }

    nmatch = regex.re_nsub + 1;
    pmatch = malloc(sizeof(regmatch_t) * nmatch);

    e = regexec(&regex, string, nmatch, pmatch, 0);
    if (e != 0) {
        fprintf(stderr, "regexec failed for RE %s\n", pattern);
        goto end;
    }

    fprintf(stderr, "\nRE %s:\n", pattern);
    for (int i = 0; i < nmatch; ++i) {
        const regoff_t x = pmatch[i].rm_so, y = pmatch[i].rm_eo;
        if (x != -1) {
            fprintf(stderr, "%d: %.*s\n", i, y - x, string + x);
        }
    }

end:
    free(pmatch);
    regfree(&regex);
}

int main(void)
{
    test("^(a|ab)(bc|c)$", "abc");
    test("^(ab|a)(bc|c)$", "abc");
    test("^(a|aa)*$", "aa");
    test("^(aa|a)*$", "aa");
    return 0;
}

$ gcc regexec_bug.c -Wall -O2 -g -oregexec_bug

Actual result:

$ ./regexec_bug

RE ^(a|ab)(bc|c)$:
0: abc
1: a
2: bc

RE ^(ab|a)(bc|c)$:
0: abc
1: ab
2: c

RE ^(a|aa)*$:
0: aa
1: a

RE ^(aa|a)*$:
0: aa
1: aa

Expected result:

$ ./regexec_bug

RE ^(a|ab)(bc|c)$:
0: abc
1: ab
2: c

RE ^(ab|a)(bc|c)$:
0: abc
1: ab
2: c

RE ^(a|aa)*$:
0: aa
1: aa

RE ^(aa|a)*$:
0: aa
1: aa

Exact glibc version:
glibc-2.29.9000-118-g86bdd49d93

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug regex/24316] regexec() violates POSIX longest match rule

glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=24316

skvadrik at gmail dot com <skvadrik at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |skvadrik at gmail dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug regex/24316] regexec() violates POSIX longest match rule

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=24316

--- Comment #1 from skvadrik at gmail dot com <skvadrik at gmail dot com> ---
Also, is there any formal description of regcomp()/regexec() implementation,
other than the source code itself? I can try to reverse-engineer it from the
source code, but it would be helpful to understand the original author's
design.

Commit 3b0bdc723579a7c6df2cace0115a6ca0977d73f9 points to Isamu Hasegawa
<[hidden email]>, but that email address no longer exists.

--
You are receiving this mail because:
You are on the CC list for the bug.