[Bug locale/2373] Restrict UTF-8 to 17 planes, as required by RFC 3629

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Bug locale/2373] Restrict UTF-8 to 17 planes, as required by RFC 3629

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=2373

--- Comment #8 from Johannes Berg <johannes at sipsolutions dot net> ---
Oh, ok. The original comment here seemed to imply that ISO was the last one to
hold out for more space than the others.


To carry over some discussion from the bug I originally filed (which was since
closed as duplicate in favour of this one):

This came up because Python does this conversion using mbstowcs() and/or
mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:

https://bugs.python.org/issue35883


Note that this isn't just about the range, but also the RFC prohibits the
surrogate pair reservations:


RFC 3629:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.


(Python internally may actually allow using this in an UTF-8-like encoded
string [that they call utf-8b] to carry arbitrary bytes around.)

--
You are receiving this mail because:
You are on the CC list for the bug.