[Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

            Bug ID: 25734
           Summary: mbrtowc with Big5-HKSCS fails to reset conversion
                    state for conversions that produce two Unicode code
                    points
           Product: glibc
           Version: 2.31
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: tom at honermann dot net
  Target Milestone: ---

mbrtowc() with Big5-HKSCS fails to reset conversion state for double byte
sequences (such as 0x88 0x62) that are converted to multiple Unicode code
points (U+00CA U+0304) after writing the second code point.  This results in
subsequent calls to mbrtowc() repeatedly writing the second code point without
consuming new input.

$ cat t.c
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* Attempt to translate Big5-HKSCS input containing a double byte sequence
     (0x88 0x62) that maps to two Unicode code points (U+00CA U+0304) followed
     by an ASCII character.  This test demonstrates that the converter fails to
     reset conversion state after writing the second code point such that each
     subsequent call of mbrtowc() writes the second code point again. */

  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
    perror("setlocale");
    return 1;
  }

  const char *mbs;
  wchar_t wc;
  mbstate_t s;
  size_t result;

  mbs = "\x88\x62\x58";
  memset(&s, 0, sizeof(s));
  /* Translate the first code unit sequence.  This call to mbrtowc() consumes
     the first two bytes and writes the first Unicode code point. */
  result = mbrtowc(&wc, mbs, 3, &s);
  assert(result == (size_t) 2);
  mbs += 2;
  printf("1st wc: 0x%04X\n", (unsigned)wc);
  assert(wc == 0x00CA);
  /* This next call to mbrtowc() writes the second Unicode code point without
     consuming any input.  Since output was written, but no input was consumed,
     0 is returned.  This is a case where mbrtoc16() would return (size_t)-3,
     but mbrtowc() isn't specified to do so.  This behavior is a bit confusing
     because the return of 0 is specified to indicate that a null character
     was written; but that isn't the case here. */
  result = mbrtowc(&wc, mbs, 1, &s);
  assert(result == (size_t) 0);
  printf("2nd wc: 0x%04X\n", (unsigned)wc);
  assert(wc == 0x0304);
  /* This next call to mbrtowc() should now consume and write the ASCII
     character.  However, the failure of the converter to reset the conversion
     state results in the second Unicode code point being written again; again
     without consuming new input. */
  wc = 0; /* set wc to an arbitrary value. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("3rd wc: 0x%04X (should be 0x0058)\n", (unsigned)wc);
  printf("result: %d (should be 1)\n", (int)result);
  /* These two assertions should succeed, but currently fail. */
  assert(wc == 0x0058);
  assert(result == 1);
}

$ gcc t.c -o t

$ ./t
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0304 (should be 0x0058)
result: 0 (should be 1)
t: t.c:52: main: Assertion `wc == 0x0058' failed.
Aborted (core dumped)


As noted in the test and by the failed assertion, the third call to mbrtowc()
should have consumed 1 new byte (0x58) and written one code point (U+0058), but
it instead consumed no input and re-wrote the previously written code point.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-03-27
             Status|UNCONFIRMED                 |NEW
                 CC|                            |carlos at redhat dot com,
                   |                            |[hidden email]
     Ever confirmed|0                           |1

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
Andreas Schwab probably has the most experience with Big5-HKSCS. I'm CC'ing him
on this this issue in case he has any thoughts about this.

I can confirm that /x88/x62 should map to <U00CA><U0304>.

Double checked that against Big5-HKSCS 2016 (which we don't support yet):
https://www.ogcio.gov.hk/en/our_work/business/tech_promotion/ccli/terms/doc/e_hkscs_2016.pdf

I can confirm the test case behaves as you indicated.

In glibc/iconvdata/big5hkscs.c we certainly handle the combining characters:

17843                 /* Check for special cases: combining characters.  */    
    \
17844                 if (idx == 195 + 0x22 /* 8862 */)                        
    \
17845                   {                                                      
    \
17846                     ch = 0xca;                                          
    \
17847                     ch2 = 0x304;                                        
    \
17848                   }                                                      
    \
17849                 else if (idx == 195 + 0x24 /* 8864 */)                  
    \
17850                   {                                                      
    \
17851                     ch = 0xca;                                          
    \
17852                     ch2 = 0x30c;                                        
    \
17853                   }                                                      
    \
17854                 else if (idx == 195 + 0x63 /* 88a3 */)                  
    \
17855                   {                                                      
    \
17856                     ch = 0xea;                                          
    \
17857                     ch2 = 0x304;                                        
    \
17858                   }                                                      
    \
17859                 else if (idx == 195 + 0x65 /* 88a5 */)                  
    \
17860                   {                                                      
    \
17861                     ch = 0xea;                                          
    \
17862                     ch2 = 0x30c;                                        
    \
17863                   }                                                      
    \
17864                 else                                                    
    \
17865                   /* This is illegal.  */                                
    \
17866                   STANDARD_FROM_LOOP_ERR_HANDLER (1);                    
    \

I haven't debugged this in great detail though.

diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c
index 01fcfeba76..70f84a5226 100644
--- a/iconvdata/big5hkscs.c
+++ b/iconvdata/big5hkscs.c
@@ -17898,6 +17898,7 @@ static struct
                                                                              \
     put32 (outptr, ch);                                                      
      \
     outptr += 4;                                                             \
+    *statep = 0;                                                             \
   }
 #define LOOP_NEED_FLAGS
 #define EXTRA_LOOP_DECLS       , int *statep

If we have a pushed back character we should clear statep.

I expect we need a test case to cover the case you outline.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

--- Comment #2 from Carlos O'Donell <carlos at redhat dot com> ---
The quick fix appears to work.

GCONV_PATH=/home/carlos/build/glibc/iconvdata ./test
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0058 (should be 0x0058)
result: 1 (should be 1)

I need to dig into this a bit more.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

--- Comment #3 from Tom Honermann <tom at honermann dot net> ---
Thanks for confirming, Carlos.

I had independently arrived at the same quick fix, but you beat me to posting
it.

I don't know glibc internals well.  Something I've been surprised by is that
iconv doesn't seem to suffer this same issue.  I haven't tried to debug to
determine why it is unaffected.

(In reply to Carlos O'Donell from comment #1)
> I expect we need a test case to cover the case you outline.

Agreed.  The patch below is what I had added locally as a test.  Feel free to
use this (I've already signed a glibc contributor form).

diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
index f02167fa58..2e58ac36e8 100644
--- a/wcsmbs/Makefile
+++ b/wcsmbs/Makefile
@@ -58,7 +58,8 @@ include ../Rules

 ifeq ($(run-built-tests),yes)
 LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 hr_HR.ISO-8859-2
\
-          ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9
+          ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9 \
+          zh_HK.BIG5-HKSCS
 include ../gen-locales.mk

 $(objpfx)tst-btowc.out: $(gen-locales)
diff --git a/wcsmbs/tst-mbrtowc.c b/wcsmbs/tst-mbrtowc.c
index 38a431983b..b2b3ff3744 100644
--- a/wcsmbs/tst-mbrtowc.c
+++ b/wcsmbs/tst-mbrtowc.c
@@ -150,6 +150,43 @@ utf8_test (void)
   return error;
 }

+static int
+big5_hkscs_test (void)
+{
+  const char *locale = "zh_HK.BIG5-HKSCS";
+  const char *mbs;
+  wchar_t wc;
+  mbstate_t s;
+
+  if (!setlocale (LC_CTYPE, locale))
+    {
+      fprintf (stderr, "locale '%s' not available!\n", locale);
+      exit (1);
+    }
+
+  /* This test exercises a special case conversion of a double byte sequence
+     to a pair of Unicode code points.  */
+  wc = 42;                     /* arbitrary number */
+  mbs = "\x88\x62";            /* 0x88 0x62 => U+00CA U+0304 */
+  memset (&s, 0, sizeof(s));
+  /* Translate the double byte sequence.  This call consumes the first two
+     bytes and writes the first of the two code points.  */
+  assert (mbrtowc (&wc, mbs, 3, &s) == (size_t) 2);
+  assert (wc == 0x00CA);
+  /* Attempt to translate the next sequence.  mbrtowc isn't specified to
+     return (size_t) -3 like mbrtoc16 is in the case where there are multiple
+     wide chars to write.  This call returns 0 despite not writing a null
+     character.  0 is returned because no new input is consumed; the call
+     writes the second code point. */
+  assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+  assert (wc == 0x0304);
+  /* The following now consumes and converts the null character. */
+  assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+  assert(wc == 0);
+
+  return 0;
+}
+

 static int
 do_test (void)
@@ -169,6 +206,8 @@ do_test (void)
   setlocale (LC_ALL, "ja_JP.EUC-JP");
   result |= check_ascii (setlocale (LC_ALL, NULL));

+  result |= big5_hkscs_test ();
+
   return result;
 }

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
Tom,

I split the test out specifically for BIG5HKSCS and included all 4 conversions
that create 2 wchar_ts.

Posted: https://sourceware.org/pipermail/libc-alpha/2020-March/112257.html

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
   Target Milestone|---                         |2.32
         Resolution|---                         |FIXED

--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
Fixed in 2.32

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

Tom Honermann <tom at honermann dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |VERIFIED

--- Comment #6 from Tom Honermann <tom at honermann dot net> ---
Verified fixed for 2.32 as of commit c580e6466d6da8262820cdbad19f32c5546226cf
(https://sourceware.org/git/?p=glibc.git;a=commit;h=c580e6466d6da8262820cdbad19f32c5546226cf)

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=25734

Florian Weimer <fw at deneb dot enyo.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fw at deneb dot enyo.de
              Flags|                            |security-
             Status|VERIFIED                    |RESOLVED

--- Comment #7 from Florian Weimer <fw at deneb dot enyo.de> ---
Please leave the status as RESOLVED. There is some automation that depends on
it.

--
You are receiving this mail because:
You are on the CC list for the bug.