[Bug libc/2373] New: iconv allows encoding characters above U+10FFFF in UTF-8

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug libc/2373] New: iconv allows encoding characters above U+10FFFF in UTF-8

glaubitz at physik dot fu-berlin.de
Two of the three standards that define "UTF-8" restrict it to encoding
characters whose code points are from 0 to 0x10FFFF.  Only the ISO
standard allows larger code points.  The Unicode standard and the RFC
both insist on only allowing code points up to 0x10FFFF to be encoded
in UTF-8.  The Unicode consortium has pledged never to assign a
character to a code point above 0x10FFFF.  People have predicted that
the ISO will come into agreement on this point soon although it seems
not to have happened yet.

The experts seem to be in agreement that for security reasons it is a
good idea to impose the restriction to a maximum code point of
0x10FFFF when encoding/decoding UTF-8.

The iconv program does not impose this restriction.  It is pretty
clear from reading the code in iconv/gconv_simple.c that it allows up
to six bytes for the UTF-8 encoding and does not check whether the
value is above 0x10FFFF.  I have encountered this in practice using
iconv to filter questionable data where iconv has let through illegal
code points.  For example, I have right now a file resulting from
"iconv -c -f UTF-8 -t UTF-8" which exhibits the illegal character
U+176DF8.

Can you please make iconv impose this restriction for UTF-8?

If needed, it might be okay to make "UTF-8" mean UTF-8 according to
Unicode and the RFC and allow another name (perhaps "UTF-8(ISO)"?)  to
mean the current unrestricted version, just in case someone
desperately needs the current UTF-8 support and (for some bizarre
reason!) its ability to encode values above 0x10FFFF.

--
           Summary: iconv allows encoding characters above U+10FFFF in UTF-8
           Product: glibc
           Version: 2.3.6
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: drepper at redhat dot com
        ReportedBy: jbwells at blueyonder dot co dot uk
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=2373

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Reply | Threaded
Open this post in threaded view
|

[Bug libc/2373] iconv allows encoding characters above U+10FFFF in UTF-8

glaubitz at physik dot fu-berlin.de

------- Additional Comments From drepper at redhat dot com  2006-04-26 06:31 -------
I don't agree at all.  There is no reason to possibly break someone's code.
Nobody has ever shown any evidence why this is a bad idea.

--
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=2373

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.