ping [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

ping [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Diego (Egor) Kobylkin
Changelog v12:
* Adjusted to the new comment style suddenly appearing in the target
file locale/C-translit.h.in (the original file changed on the master
branch from /* style to # style since v11)
* Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
"sh`" instead of erroneous "SH`" in v11

Changelog v11:
* Re-targeted the patch against locale/C-translit.h.in as the proper
file for the ASCII translit table.
* Correspondingly the patch now only contains the additional
Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
The 'include "translit_cyrillic";""' directives are not necessary in the
locale files and they are now all left intact.
* Also the file translit_cyrillic is not longer needed and is omitted.
* Edited below email, commit message.

Changelog v10:
* Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
with diacritics) as conflicting with System B within glibc mechanics and
not solving BZ #2872
* Edited below email, commit message, comment in translit_cyrillic to
reflect System A removal
* Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
using composition) as composing is not covered by current glibc
conversion mechanics

Changelog v9:
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch

Changelog v8:
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).

Changelog v7:
* Generated against git://sourceware.org/git/glibc.git master with git
format-patch.
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
translit_end previously.)
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
“copy "tr_TR"”.

Changelog v6:
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
    to sequences of all uppercase Latin letters in all languages (whenever
    a Cyrillic letter is transliterated to more than one Latin letter),
    for example "Ї" is now transliterated as "YI" rather than "Yi".

Dear locale maintainers,

fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"

https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]

add the Cyrillic transliteration rows to locale/C-translit.h.in.

The patch is attached.


Current bug effect:

The glibc wiki explicitly lists this use case as the test example and
currently it fails on Cyrillic texts [1] [8] [9]:

iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC

CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.

- it produces a string of question marks and spaces.

This is what it should produce and it does so after the patch applied:

CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.


The root problem and the fix:

The root problem is the missing transliteration table that I am
supplying here.


COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.

Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription.

While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.

The patch content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 System B official source (Federal Agency on
Technical Regulating and Metrology Of Russian Federation [2]).
Technically an independent but mostly identical source [3] was used and
prepared in a spreadsheet [6].

The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
System B represents what is actually called transcription (preserving
phonemes), while System A is the transliteration (preserving graphemes).
There is no meaningful way to preserve graphemes converting Cyrillic to
ASCII and thus the System B is chosen [11]. To be super clear the System
A has nothing to do with this bug regardless it being a transliteration.

Those interested in implementing System A for transliteration of
Cyrillic to Latin with Diacritic as a new feature are welcome to use the
spreadsheet in [6] as a starting point.

Links:

[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
[11]
https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3

Best regards,
Egor Kobylkin





0001-Locales-Cyrillic-ASCII-transliteration-table-BZ-2872.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

[PING^4][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Marko Myllynen
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:

> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
>
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
>
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
>
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
>
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
>
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
>
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
>
> Dear locale maintainers,
>
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
>
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
>
> The patch is attached.
>
>
> Current bug effect:
>
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
>
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
>
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
>
> - it produces a string of question marks and spaces.
>
> This is what it should produce and it does so after the patch applied:
>
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
>
>
> The root problem and the fix:
>
> The root problem is the missing transliteration table that I am
> supplying here.
>
>
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
>
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
>
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
>
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
>
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
>
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
>
> Links:
>
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
>
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
>
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
>
>
> Best regards,
> Egor Kobylkin
>
>
>
>


--
Marko Myllynen
Reply | Threaded
Open this post in threaded view
|

[PING^5][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Diego (Egor) Kobylkin
In reply to this post by Diego (Egor) Kobylkin
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:

> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
>
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
>
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
>
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
>
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
>
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
>
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
>
> Dear locale maintainers,
>
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
>
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
>
> The patch is attached.
>
>
> Current bug effect:
>
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
>
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
>
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
>
> - it produces a string of question marks and spaces.
>
> This is what it should produce and it does so after the patch applied:
>
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
>
>
> The root problem and the fix:
>
> The root problem is the missing transliteration table that I am
> supplying here.
>
>
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
>
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
>
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
>
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
>
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
>
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
>
> Links:
>
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
>
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
>
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
>
>
> Best regards,
> Egor Kobylkin
>
>
>
>


--
Marko Myllynen
Reply | Threaded
Open this post in threaded view
|

Re: [PING^5][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Siddhesh Poyarekar-8
On 05/04/19 1:14 AM, Egor Kobylkin wrote:
> Ping?
>

I'm committing to looking at this on Monday if nobody gets to it ovevr
the weekend.

Siddhesh
Reply | Threaded
Open this post in threaded view
|

[PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Marko Myllynen
In reply to this post by Diego (Egor) Kobylkin
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:

> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
>
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
>
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
>
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
>
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
>
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
>
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
>
> Dear locale maintainers,
>
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
>
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
>
> The patch is attached.
>
>
> Current bug effect:
>
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
>
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
>
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
>
> - it produces a string of question marks and spaces.
>
> This is what it should produce and it does so after the patch applied:
>
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
>
>
> The root problem and the fix:
>
> The root problem is the missing transliteration table that I am
> supplying here.
>
>
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
>
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
>
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
>
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
>
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
>
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
>
> Links:
>
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
>
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
>
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
>
>
> Best regards,
> Egor Kobylkin
>
>
>
>


--
Marko Myllynen
Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Carlos O'Donell-5
On 4/16/19 3:15 AM, Marko Myllynen wrote:
> Ping?

I have this patch applied locally and I'm working through some
comparisons for the transliteration.


--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Diego (Egor) Kobylkin
Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f UTF-8
-t ASCII//TRANSLIT <<<
"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ
ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"

And this is the expected result ("" added by myself):
"YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`
G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"

Bests,
Egor Kobylkin


On 16.04.19 15:17, Carlos O'Donell wrote:
> On 4/16/19 3:15 AM, Marko Myllynen wrote:
>> Ping?
>
> I have this patch applied locally and I'm working through some
> comparisons for the transliteration.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Carlos O'Donell-5
On 4/16/19 1:06 PM, Egor Kobylkin wrote:
> Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f UTF-8 -t ASCII//TRANSLIT <<< "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
>
> And this is the expected result ("" added by myself):
> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e` G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"

Thanks.

I was using CyrTranslit (python translater) to review other work done in this area,
but it wasn't very fruitful.

$ python3
Python 3.7.3 (default, Mar 27 2019, 13:36:35)
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cyrtranslit
>>> cyrtranslit.supported()
dict_keys(['sr', 'me', 'mk', 'ru'])
>>> cyrtranslit.to_latin("ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’")
'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'
>>>

"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'

Which doesn't give a good transliteration.

But the table is better:
https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L138-L155

Ё -> YO.

Which is a good cross-check for me.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Diego (Egor) Kobylkin


On 16.04.19 19:58, Carlos O'Donell wrote:

> On 4/16/19 1:06 PM, Egor Kobylkin wrote:
>> Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f
>> UTF-8 -t ASCII//TRANSLIT <<<
>> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ
>> ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
>>
>> And this is the expected result ("" added by myself):
>> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`
>> G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"
>>
>
> Thanks.
>
> I was using CyrTranslit (python translater) to review other work done in
> this area,
> but it wasn't very fruitful.
>
> $ python3
> Python 3.7.3 (default, Mar 27 2019, 13:36:35)
> [GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import cyrtranslit
>>>> cyrtranslit.supported()
> dict_keys(['sr', 'me', 'mk', 'ru'])
>>>> cyrtranslit.to_latin("ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’")
>>>>
> 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'
>
>>>>
>
> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
>
> 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'
>
>
> Which doesn't give a good transliteration.

I guess the reason for that is that it is using the first key 'sr' from
your list that stands for Serbian. And Serbian doesn't have those
characters that are omitted ( "Щ" for example).

> But the table is better:
> https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L138-L155 
>
>
> Ё -> YO.
>
> Which is a good cross-check for me.

Yet the closest one from that codebase should be this
https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L88

It is exactly the reason we had 12 iterations on this patch - we wanted
to cover the most complete yet workable standard for the table. What we
reference in the bug memo is the actual accepted standard. It is
coalesced with the extended standard for further outdated cyrillic letters.

Bests,
Egor Kobylkin



Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Carlos O'Donell-5
On 4/16/19 2:41 PM, Egor Kobylkin wrote:
> It is exactly the reason we had 12 iterations on this patch - we
> wanted to cover the most complete yet workable standard for the
> table. What we reference in the bug memo is the actual accepted
> standard. It is coalesced with the extended standard for further
> outdated cyrillic letters.

I agree, and this is what makes review complicated and time
consuming. I'm relying on you as the expert, and my goal is only
to spot check for any inconsistencies.

--
Cheers,
Carlos.
Reply | Threaded
Open this post in threaded view
|

Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Marko Myllynen
Hi Carlos,

On 16/04/2019 22.06, Carlos O'Donell wrote:

> On 4/16/19 2:41 PM, Egor Kobylkin wrote:
>> It is exactly the reason we had 12 iterations on this patch - we
>> wanted to cover the most complete yet workable standard for the
>> table. What we reference in the bug memo is the actual accepted
>> standard. It is coalesced with the extended standard for further
>> outdated cyrillic letters.
>
> I agree, and this is what makes review complicated and time
> consuming. I'm relying on you as the expert, and my goal is only
> to spot check for any inconsistencies.

I know you've been very busy with everything else but did you happen to
have any chance to check this further, shall we still wait for your
results or how would you suggests us to proceed?

Thanks,

--
Marko Myllynen