When a special string composed of single and multi-byte characters is passed to
re_search(), the function seems to lose track of which characters are
multi-byte and returns an incorrect match. This seems to be exclusive to the
The problem can be reproduced when the following string:
... is matched against the pattern:
The two bytes in the pattern are respectively "the last byte of the second
multi-byte char" and "the first byte of the third multi-byte char" in the
The number of "a"s prefixed in the original string seems to make all the
difference here. I could only reproduce the problem when exactly 3 or 4 "a"s
are prefixed. I.e., if you remove one "a" from the prefix of the original
... the problem no longer happens.
I'm attaching a script that reproduces the problem. The 'sed' version I'm using
is compiled with "--without-included-regex", so it should use glibc's regex
functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm
trying to create a self contained program to demonstrate the problem.
Proposed fix. There is another bug in sed that triggers infinite loop.
re_search_internal() inside switch(match_kind) in case 6 finds a possible
match. In case of our false match, verification of match not respecting
multi-byte characters fails and match_regex() returns index of such false
Going deeper, re_search_internal() calls re_string_reconstruct() and that calls
re_string_skip_chars() is a I18N specific function that jumps by characters up
to the indexed character. It is a multi-byte character wise function.
In case of correct run, it returns correct index to the next character to
inspect. In case of bug occurrence, __mbrtowc called from there returns -2
(incomplete multi-byte character). Why? It seems to be caused by remain_len
being equal 1, even if there is still 6 bytes to inspect
I believe, that remain_len is computed incorrectly:
If my observation is correct, the bug is not EUC-JP specific.
- Charset must be capable to constitute false match on the boundary of two
characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific charset.
- This false match must appear in an exact place near two thirds of the string.
Changes since comment 3:
- Testcase uses test-skeleton.c.
- Uses SBC_MAX and includes "regex_internal.h".
- Setup fastmap before call to re_compile_pattern.
- bug-regex33.6 comment updated: There is one true and one false match.