Jump to content

Template talk:Indic encoding

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Extra Codepoints for Tai Tham

[edit]

@Vanisaac: Tai Tham can have alternative subscript forms (mostly for non-coda consonants) and superscript (mostly final) consonants. Does it make sense to add lanaaltcp and lanatopcp for these? I'm currently (ab)using lana2cp for U+1A5B TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA at Ṭha (Indic). Ba (Indic) would use |lanacp=1a3b |lanaaltcp=1a5b |lana2cp=1a3c |lanatopcp=1a5a. U+1A5A is essentially the glyph of U+1A3B in an unusual position.

The 'top' codepoints would be used for Indic NGA, BA, RA, VA. I might need 'top2' for U+1A58 TAI THAM SIGN MAI KANG LAI, which is like a repha, but for NGA.

One the other hand, ideally I would add lana3cp to allow |lanacp=1a3f |lana2cp=1a40 |lana3cp=1a6d. U+1A6D TAI THAM VOWEL SIGN OY, used to write Lao and Tai Lue, is the theoretical subscript form of U+1A40 and represents a diphthong, but corresponds historically to the use of U+0EBD LAO SEMIVOWEL SIGN NYO to write the sound /ɔːi/. (Some people mistakenly use it to write subscript LOW YA, which is encoded as <SAKOT, LOW YA>.) --RichardW57m (talk) 12:57, 26 October 2021 (UTC)[reply]

4 Descendants in Lao Script

[edit]

I think we need 4 descendants in the Khmu alphabet (part of the Lao script):

  1. U+0E8D LAO LETTER NYO
  2. U+0EA2 LAO LETTER YO
  3. U+0EBD LAO SEMIVOWEL SIGN NYO
  4. U+0EDF LAO LETTER KHMU NYO

Although U+0EBD arose as the subscript form of U+0E8D, which is its traditional use in Lao, it has been adopted in Khmu as an initial consonant with sound /ɟ/. (I suspect because one allomorph looks like English 'J'.) --RichardW57m (talk) 12:57, 26 October 2021 (UTC)[reply]

  • @RichardW57m: How common are these exceptions? It looks like the extra Tai Tham codepoints might happen often enough to warrant accommodating in the full template. I would suggest that for the Lao Nyo/Yo set that you use {{charmap}} like in the documentation under "Additional codepoints" (I just added that section). This template is really just a wrapper for {{charmap}} that consistently organizes these codepoints for easy reference, and utilizes the existing parameters from the predecessor {{Indic glyph}} template. VanIsaac, MPLL contWpWS 21:08, 26 October 2021 (UTC)[reply]
    @Vanisaac: Indic ya is exceptional in how many variants it can spin off. (Almost as bad as Semitic waw giving English 'f', 'u', 'v', 'y', 'w'.) On the other hand, the alternative subscripts for Tai Tham are not so exceptional: Indic ttha, pa, ba, ma, ra, la, sa. (I'm strongly tempted to use the absent nga slot for 'top2'.) I've just unilaterally added '2' for Myanmar, and I realise I need, in the codepoints at least, to cater for the Myanmar medials: there are 8 of them, and there is some malarkey with one of them being for a different consonant in S'gaw Karen. --RichardW57 (talk) 21:44, 26 October 2021 (UTC)[reply]
Oh, that's always fun when you have one alphabet that screws up the whole script. You seem to have already fished out how to add a second Burmese codepoint. The rule of thumb I used was to limit it to 9 characters in a row. For the difference between an "alt" vs. "2" codepoint, the distinction really lies in whether you are talking about a character with a different identity, or whether it's fundamentally the same character with a different presentation for certain uses. So "xxxxcp" has a subjoined form of "xxxxaltcp", while a second character is "xxxx2cp". If you needed addtional alt forms, you would have "xxxxcp", "xxxxaltcp", "xxxxalt2cp" ... "xxxxaltNcp", while the second character would be "xxxx2cp", "xxxx2altcp", "xxxx2alt2cp" ... "xxxxNcp", "xxxxNaltcp", "xxxxNalt2cp" ... "xxxxNaltNcp". I think keeping the parameter names all in the +number / "alt" (+ number) format is the most consistent. VanIsaac, MPLL contWpWS 21:56, 26 October 2021 (UTC)[reply]
@Vanisaac: How do you regard the 7 New Tai Lue final consonants? In form they're regular letters plus a diacritic, but they're encoded separately. I am inclined to treat them as 'presentation' forms. There are also 6 conjuncts with LOW VA, all encoded as encoded characters. (I might count SIGN LAEV as well - I'm not sure about that. Historically, it's a ligature of LOW LA, a vowel and a subscript consonant.) Ka is the only letter with both, so I could treat KVA as a distinct letter (talu2cp) simply to keep the apparently maximum count down and name the other 12 of as 'presentation' forms. New Tai Lue Ga may need special treatment - it has LOW KA, LOW XA, LOW KVA and LOW XVA. --RichardW57m (talk) 13:02, 27 October 2021 (UTC)[reply]
  • I think a final consonant would be an alternate form. Ostensibly, you could have two descendant letters, each with an independent final version that is definitely traceable to that letter. That would indicate it's not the same as just another descendant letter, but rather a particular form of each of the descendant letters. I would recommend using the extra charmap for encoded conjuncts with a note - they seem to be qualitatively different than an additional descendant or alternate form. VanIsaac, MPLL contWpWS 21:38, 27 October 2021 (UTC)[reply]

Loose Charmap Calls

[edit]

@Vanisaac:: I don't like the idea of unannounced extra codepoints. If ones grows used to using this templates' outputs, one will get used to looking for a scripts' codepoints at a particular location, and then look no further. Do you think we can rely on users looking at the list of contents, and seeing a header like 'Additional Tai Tham Codepoints'? (Siamese sextuplets or whatever the Burmese script is are another matter.) --RichardW57m (talk) 13:47, 27 October 2021 (UTC)[reply]

@Vanisaac: Now in use, in Pa (Indic), which is on my watch list. --RichardW57 (talk) 21:26, 27 October 2021 (UTC)[reply]

Codepoint Order

[edit]

Is there any standard to the ordering of the codepoints in the invocations? I had begun to think that after the ISCII scripts, scripts were kept together and ordered by the lowest codepoint in each script, but the more I revise, the less this seems to be the case. --RichardW57m (talk) 13:47, 27 October 2021 (UTC)[reply]

The ordering places parent scripts before child scripts, and tries to group the most closely related scripts next to each other, but beyond those basics there is nothing inherent about the exact ordering that is meaningful - mostly it was going down and left on the tree in {{Indic glyph}}, which has a general older to younger trend, and a slight northern India-outwards bias. For the ISCII scripts, it's literally in order of the shift codes for switching scripts in ISCII. I would hesitate to organize by codepoint specifically because it assumes a priori understanding of the content that is being presented instead of organizing that content to be found without that foreknowledge. If this were instead articles about the computer encoding of Indic scripts, then I would think grouping by codepoint would probably be ideal. In the end, they had to go in some order, and this is how it ended up. If we can increase the intuitive logic by moving some things around, I'd be all for it. VanIsaac, MPLL contWpWS 19:03, 27 October 2021 (UTC)[reply]
@Vanisaac: I'm talking about the order in the invocations of this template, which is only visible to editors, not the structure within charmap. Swapping entries doesn't change the appearance of the formatted pages. --RichardW57 (talk) 19:44, 27 October 2021 (UTC)[reply]
That sort of thing is completely up to the editor in question when you are adding content. I wouldn't just change it around as a WP:Cosmetic edit, but if rearranging them enables you to derive information to get more content added, please be my guest. They ended up in that order because I programmed the substitution template using the documentation table. So everything just ended up in that order. VanIsaac, MPLL contWpWS 20:17, 27 October 2021 (UTC)[reply]