[Bug locale/2373] Restrict UTF-8 to 17 planes, as required by RFC 3629

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[Bug locale/2373] Restrict UTF-8 to 17 planes, as required by RFC 3629

Sourceware - glibc-bugs mailing list

--- Comment #8 from Johannes Berg <johannes at sipsolutions dot net> ---
Oh, ok. The original comment here seemed to imply that ISO was the last one to
hold out for more space than the others.

To carry over some discussion from the bug I originally filed (which was since
closed as duplicate in favour of this one):

This came up because Python does this conversion using mbstowcs() and/or
mbrtowc(), but then later goes to check that valid characters were returned.

The python discussion is here:


Note that this isn't just about the range, but also the RFC prohibits the
surrogate pair reservations:

RFC 3629:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent

(Python internally may actually allow using this in an UTF-8-like encoded
string [that they call utf-8b] to carry arbitrary bytes around.)

You are receiving this mail because:
You are on the CC list for the bug.