[Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

            Bug ID: 25744
           Summary: mbrtowc with Big5-HKSCS returns 2 instead of 1 when
                    consuming the second byte of certain double byte
                    characters
           Product: glibc
           Version: 2.31
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: tom at honermann dot net
  Target Milestone: ---

The following test case demonstrates an issue with the Big5-HKSCS converter
that occurs when certain double byte characters are consumed one byte at a
time.

Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:

> The mbrtowc function returns the first of the following that applies
> (given the current conversion state):
>
> 0 if the next n or fewer bytes complete the multibyte character that
> corresponds to the null wide character (which is the value stored).
>
> between 1 and n inclusive if the next n or fewer bytes complete a valid
> multibyte character (which is the value stored); the value returned is
> the number of bytes that complete the multibyte character.
>
> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
>
> (size_t)(-1) if an encoding error occurs, in which case the next n or
> fewer bytes do not contribute to a complete and valid multibyte
> character (no value is stored); the value of the macro EILSEQ is stored
> in errno, and the conversion state is unspecified.

The following test case converts the Big5-HKSCS double byte character 0x88 0x62
one byte at a time.  This is one of the special double byte characters that
converts to two Unicode code points (U+00CA U+0304).  When consuming the second
byte, mbrtowc() returns a value of 2, but 1 should be returned since only one
byte was consumed; the wording quoted above doesn't allow for the return value
to be larger than the value of 'n' that was passed in.

I don't know if this issue only occurs for double byte characters that map to
multiple Unicode code points.

$ cat t.c
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* This test case demonstrates glibc's current (2.31) behavior when
     attempting to convert, one byte at a time, Big5-HKSCS input
     containing a double byte sequence that maps to multiple Unicode
     code points.

     The first call to mbrtowc() consumes the first byte and returns
     a value of -2 indicating an incomplete multibyte character as
     expected.  However, the second call, with an input length of 1,
     consumes the second byte, recognizes completion of the previously
     incomplete character, writes the mapped Unicode code point, and
     then returns 2.  The return value of 2 is surprising since only
     1 byte was read.  This seems to violate the C standard since the
     return value is greater than the input length.

     This issue does not occur for all double byte characters. */

  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
    perror("setlocale");
    return 1;
  }

  const char *mbs;
  wchar_t wc;
  mbstate_t s;
  size_t result;

  mbs = "\x88\x62";
  memset(&s, 0, sizeof(s));
  /* This call to mbrtowc() consumes the first byte and returns -2 indicating
     that a potentially valid but incomplete character was read.  This is
     expected behavior. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("1st mbrtowc call:\n");
  printf("  result: %zd (-2 expected)\n", result);
  assert(result == (size_t) -2);
  mbs += 1;
  /* This call consumes the second byte to complete the double byte character
     and writes the Unicode code point.  The C standard requires a return
     value of 1 since only one byte was consumed by this call, but glibs
     returns 2 (presumably corresponding to the total number of bytes that
     contributed to the multibyte character. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("2nd mbrtowc call:\n");
  printf("  result: %zd (1 expected, glibc returns 2)\n", result);
  printf("  wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
  mbs += 1;
  assert(result == (size_t) 2); /* Current glibc behavior */
  assert(wc == 0x00CA);
  /* This call writes the second Unicode code point, but does not consume
     any input.  0 is returned since no input is consumed.  According to
     the C standard, a return of 0 is reserved for when a null character is
     written, but since the C standard doesn't acknowledge the existence of
     characters that can't be represented in a single wchar_t, we're already
     operating outside the scope of the standard.  Returning 0 seems
     reasonable to me.  Returning -3 as mbrtoc16() does would also be
     reasonable. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("3rd mbrtowc call:\n");
  printf("  result: %zd (0 expected)\n", result);
  printf("  wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
  mbs += 0;
  assert(result == (size_t) 0);
  assert(wc == 0x0304);
  /* Attempting to process any further input would run afoul of a separate
     issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=25734. */
}

$ gcc t.c -o t

$ ./t
1st mbrtowc call:
  result: -2 (-2 expected)
2nd mbrtowc call:
  result: 2 (1 expected, glibc returns 2)
  wc: 0x00CA (0x00CA expected)
3rd mbrtowc call:
  result: 0 (0 expected)
  wc: 0x0304 (0x0304 expected)

As noted in the code comments and process output, the return value for the
second call to mbrtowc() is unexpected.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
                 CC|                            |carlos at redhat dot com
   Last reconfirmed|                            |2020-03-29

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Tom Honermann from comment #0)
> 2nd mbrtowc call:
>   result: 2 (1 expected, glibc returns 2)
>   wc: 0x00CA (0x00CA expected)

> As noted in the code comments and process output, the return value for the
> second call to mbrtowc() is unexpected.

My opinion is that this is an implementation defect in the converter which has
absolute control over how those return values are computed.

My worry though is that changing this now might break existing code that is out
there that is using BIG5HKSCS.

If that's actually the case then we could create a distinct gconv module with
the old behaviour and release that.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #2 from Andreas Schwab <[hidden email]> ---
I don't think you can call mbrtowc with a different value for mbs after the
first call returned -2, without resetting the state inbetween.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #3 from Carlos O'Donell <carlos at redhat dot com> ---
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html

Options:

(a) Do not change the current converter. We return 2 consumed bytes in
    the first conversion, and 0 on the second.

(b) Look ahead and split the conversion. We return 1 consumed in the
    first conversion, and 1 on the second. This prevents us from returning
    0 which may be interpreted as a L'\0'. This leads to false assumption
    that the user could stop the conversion at this point and modify the
    input.

(c) Design something new. Return (size_t) -3 indicating a One-to-Many
    conversion is happening and that there is more output to be generated.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #4 from Tom Honermann <tom at honermann dot net> ---
(In reply to Andreas Schwab from comment #2)
> I don't think you can call mbrtowc with a different value for mbs after the
> first call returned -2, without resetting the state inbetween.

The standard is not clear here, but I think it is required to increment 'mbs'
by 'n' in such cases.  Existing tests in glibc for other encodings demonstrate
this intent.  For examples, see the tests for UTF-8 in wcsmbs/tst-mbrtowc.c.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #5 from Andreas Schwab <[hidden email]> ---
`n' is what?

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #6 from Tom Honermann <tom at honermann dot net> ---
(In reply to Andreas Schwab from comment #5)
> `n' is what?

The value for 'n' passed to mbrtowc() and that corresponds to the specification
for the -2 return value:

> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #7 from Tom Honermann <tom at honermann dot net> ---
I debugged and identified the root cause for this issue.  The problem is that
the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the
mbstate_t __count data member.

The relevant code is below.

In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached
bytes stored in 'state' into an internal buffer (bytebuf).  The number of
previously cached bytes is stored in the three lowest bits of 'state->__count'.
 (The Big5-HKSCS converter does not define UNPACK_BYTES).

Lines 485-490 are responsible for the packing of processed bytes and the count
of them when insufficient bytes were available to process a complete character.
 (The Big5-HKSCS converter does not define STORE_REST).

The line most relevant to this issue is line 477 where the number of bytes to
advance the input pointer provided by the caller is determined.  The
calculation is performed by determining how much of the internal buffer
(bytebuf) was read by the converter, and then subtracting the number of bytes
that were populated from the cache (lines 396-397).

The Big5-HKSCS converter runs at line 446 in the expansion of BODY.  Note that
this occurs between the lines that unpack cached bytes (396-397) and the line
that computes how far to advance the input pointer (477).  (more after the code
break).

iconv/loop.c:
366 static inline int
367 __attribute ((always_inline))
368 SINGLE(LOOPFCT) (struct __gconv_step *step,
369                  struct __gconv_step_data *step_data,
370                  const unsigned char **inptrp, const unsigned char *inend,
371                  unsigned char **outptrp, unsigned char *outend,
372                  size_t *irreversible EXTRA_LOOP_DECLS)
373 {
...
391 #  ifdef UNPACK_BYTES
...
393 #  else
394   /* Add the bytes from the state to the input buffer.  */
395   assert ((state->__count & 7) <= sizeof (state->__value));
396   for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen)
397     bytebuf[inlen] = state->__value.__wchb[inlen];
398 #  endif
...
441   inptr = bytebuf;
442   inend = &bytebuf[inlen];
...
446       BODY
...
477       *inptrp += inend - bytebuf - (state->__count & 7);
478 #  ifdef STORE_REST
...
482 #  else
483       /* We don't have enough input for another complete input
484          character.  */
485       assert (inend - inptr > (state->__count & ~7));
486       assert (inend - inptr <= sizeof (state->__value));
487       state->__count = (state->__count & ~7) | (inend - inptr);
488       inlen = 0;
489       while (inptr < inend)
490         state->__value.__wchb[inlen++] = *inptr++;
491 #  endif
492     }
493
494   return result;
495 }

The Big5-HKSCS converter also stores information in 'state->__count'.  It does
not use the lower most 3 bits for its own state (where the number of cached
characters is stored; it appears to acknowledge that they are reserved).  The
problem is that there are numerous places where the converter updates
'state->__count', but fails to preserve those bits (there are also cases where
it fails to properly ignore them).  When these bits are not preserved, line 477
above miscalulates the number of bytes consumed from the input.

Cases where those bits are not properly handled are shown below.

iconvdata/big5hkscs.c:
17773 #define EMIT_SHIFT_TO_INIT \
17774   if (data->__statep->__count != 0)                                      
    \
.....
17783               data->__statep->__count = 0;     .....
.....
17797               data->__statep->__count = 0;                              
    \
.....
17803     }
.....
17812 #define BODY \
17813   {                                                                      
    \
.....
17883                 *statep = ch2 << 3;                                      
    \
.....
17901   }
.....
17920 #define BODY \
17921   {                                                                      
    \
.....
17948         *statep = 0;                                                    
    \
.....
17961         *statep = 0;                                                    
    \
.....
17998                 *statep = ((cp[0] << 8) | cp[1]) << 3;                  
    \
.....
18019   }

A local update to preserve the lower 3 bits sufficed to resolve this issue.
I'm working on a patch.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #8 from Tom Honermann <tom at honermann dot net> ---
A patch for this issue has been submitted to the libc-alpha mailing list.
- https://sourceware.org/pipermail/libc-alpha/2020-April/112595.html

--
You are receiving this mail because:
You are on the CC list for the bug.