[Bug locale/26034] New: mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] New: mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

            Bug ID: 26034
           Summary: mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5,
                    instead of -1 with UTF-8 locale
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: johannes at sipsolutions dot net
  Target Milestone: ---

Created attachment 12566
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12566&action=edit
simple test program

It seems that according to RFC 3629, -1 should be returned here since an
invalid character is encoded, that's U+2f43580, not in range [U+0, U+10ffff].

This came up because Python does this conversion using mbstowcs() and/or
mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:

https://bugs.python.org/issue35883

but given the language in RFC 3629, it seems like an issue in glibc:


3.  UTF-8 definition

[...]

   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.

[...]

      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[...]

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.

[...]

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

--- Comment #1 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
Probably the same issue as bug 2373.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://sourceware.org/bugz
                   |                            |illa/show_bug.cgi?id=2373

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-
                 CC|                            |fweimer at redhat dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

--- Comment #2 from Johannes Berg <johannes at sipsolutions dot net> ---
Hm, yeah, that sounds the same, I had only searched for the specific
function(s), not the broader issue. I guess I won't hold my breath for this to
get fixed then ...

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|https://sourceware.org/bugz |
                   |illa/show_bug.cgi?id=2373   |
         Resolution|---                         |DUPLICATE
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #3 from Florian Weimer <fweimer at redhat dot com> ---
Marking as duplicate per comment 2.

*** This bug has been marked as a duplicate of bug 2373 ***

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

--- Comment #4 from Johannes Berg <johannes at sipsolutions dot net> ---
Hm, note though. I was just mentioning this to somebody, and 2373 talks about
*encoding* while this is mostly about *decoding*. So it's related, but not
exactly the same. Up to you whether or not you want to treat it as a duplicate,
but it's two sides of the same coin. An argument could be made, for example,
for allowing *encoding* it (since why did the application store something
>0x10ffff in a wchar_t to start with, that was already invalid) but not
*decoding* it, even if that breaks the round-trip property.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug locale/26034] mbrtowc(&x, "\xfa\xbd\x83\x96\x80", 5, NULL) return 5, instead of -1 with UTF-8 locale

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26034

--- Comment #5 from Florian Weimer <fweimer at redhat dot com> ---
Fair point. I have retitled bug 2373 accordingly.

--
You are receiving this mail because:
You are on the CC list for the bug.