Bug #: 14038
Summary: strcoll sorting order
AssignedTo: [hidden email] ReportedBy: [hidden email] CC: [hidden email] Classification: Unclassified
(not sure if that's an implementation or documentation issue)
In utf8 locales, some string comparisons depend on the length of the strings,
not sure if that's supposed to work that way (if so, it would be good to have a
reference to a standard defining these rules in the docs) or it is just a bug.
For example, if strcoll is used as a comparison function, these strings will be
sorted as follows:
--- Comment #1 from Andrzej <ndrwrdck at gmail dot com> 2012-05-01 03:58:14 UTC ---
Just to clarify, I run into this issue(?) when we tried to optimize sorting in
Our assumption was that, knowing that the first character of two strings are
different, comparing just these characters is as good as comparing the whole
strings, that is if 'あ' < 'a' then 'あaaa' < 'aa'. This assumption fails with
the current design of strcoll.
--- Comment #3 from Andreas Schwab <[hidden email]> 2012-05-01 06:54:12 UTC ---
The common sorting weights from iso14651_t1_common has no entry for japanese
characters, so they are ignored in the first pass. The ja_JP locale sorts them
after the latin characters.
What |Removed |Added
CC| |pasky at ucw dot cz
--- Comment #4 from Petr Baudis <pasky at ucw dot cz> 2012-05-01 08:58:15 UTC ---
Marking as INVALID, thanks to Andreas for taking care to explain. Indeed, the
sorting is locale-dependent and may ignore various (usually the unknown)
characters. Set LC_COLLATE to POSIX if you want "programmer-friendly" sorting
order. Andrzej, feel free to reopen if you have more questions.
--- Comment #5 from Andrzej <ndrwrdck at gmail dot com> 2012-05-01 10:44:21 UTC ---
Just wanted to ask if there is any plan of adding Japanese definition to
iso14651_t1_common file. The current behavior doesn't seems particularly
Also, the documentation issue is still valid - for a nontrivial function like
this, there should be at least some hints about where to find the comparison
rules or what standards does it comply with.
(I'm satisfied with your explanation so I don't reopen the bug. Please feel
free to reopen/reassign it if you think the above issues need to be addressed.)