User talk:Aleksandar Šušnjar/Serbian Wikipedia's Challenges

Font (Typographical) Correctness

I had this problem when writing software to identify the language of a section of text. Not all the text I got was encoded correctly in Unicode. For example I had a document (the UNUDHR) in Macedonian and I was wondering why it wasn't identifying it correctly when all the rest worked. I did a hexdump of the file and lo-and-behold, it was using the Latin code points for 'E', 'H', 'A', 'J' and 'O' - maybe to save on space - Latin encoded characters only take up one byte in UTF-8, whereas Cyrillic take up two.

The word Декларација. ("Declaratsija") in Macedonian demonstrates this effectively:

Misencoded string (misencoded characters in bold):

0000000 94d0 d065 d0ba 61bb 80d1 d161 d086 6ab8 0000010 2e61 000a 0000013

String encoded correctly:

0000000 94d0 b5d0 bad0 bbd0 b0d0 80d1 b0d0 86d1 0000010 b8d0 98d1 b0d0 0a2e 0000018

(output from hexdump)

Just thought I'd contribute a bit :) - FrancisTyers 22:01, 10 March 2006 (UTC)[reply]

Декларација: This took me some time to decode without any tools on the computer I'm using (I was decoding it personally) ... there are some extra characters there and 16-bit words don't help here (have to swap all byte pairs). In any case, it appears that cyrillic letters have been replaced with lookalikes. I highly doubt that this has been done intentinally and to save space. What may have happened, for example, is that the author did not have a proper keyboard so either remapped it partially (only the letters that look different) or was typing all non-lookalikes "with a mouse" (by selecting them in some other tool) and just saving time by using latin lookalikes for others...

I'll check UNUDHR when I come home tonight...

--Aleksandar Šušnjar 00:04, 11 March 2006 (UTC)[reply]

I did a search for Декларација (mixed alphabets) ... and found it at www.lexilogos.com/declaration/macedonien.htm ! Same site contains translations for other languages I understand (other than Macedonian) and they, too, contain errors. I also checked Macedonian Wikipedia. You can find an article Листа на статии кои Македонската Википедија би требало да ги има (List of articles Macedonian Wikipedia should have) and in it a (broken) link to "Декларација за правата на човекот" (Human rights declaration) - spelled correctly, as expected.

All I can tell you is that you stumbled up on an anomaly. Many languages that do not use Latin as primary or use "extended" Latin (e.g. Serbian/Croatian Š, Č, Ć, Ž, Đ) "suffer" from novice users trying to come up with ways to depict "extra" letters using alternative ways because they don't know how to set up and use national keyboards or are just too lazy to switch. To scare you a little, I'll give you an example of how a Serbian word "Дођош" can be spelled:

Dođoš (the only correct way)
Dodjos
Dodos
Dodyosh
Doddoss
Dodyosx
Dogyosh
Dod-os"
Dod~os"
Dod'os^
Dodow
Do]o[ (because Serbian keyboard has letter "Đ" in place of sign "]" and "Š" in place of "[")

etc.

Cyrillic was also a problem. For example, while Russian Cyrillic was the only available (fonts), many used Russian similar-looking characters to replace the non-existing ones. For example, used Ћ instead of Ђ.

That is the "dark side". Bright side is that the appearance of those is incredibly small, to the point of being simply ignorable. Can it be nevertheless handled - yes, easily. If one identifies the language and script, then translator could "forgive" such mistakes and map Latin lookalike letters to proper Cyrillic before actually performing translation. Should that be done? I don't think it is worth doing...

--Aleksandar Šušnjar 04:18, 11 March 2006 (UTC)[reply]

Yeah, it isn't very frequent, but something to bear in mind when checking code, it might not be your code which has the issue :) I was diving around a long time because I thought something was screwed up in the code! And yeah, I don't think it is worth doing really, you could enforce a one writing system per word limit, but this would be overkill I think, and anything else is going to interfer with trademarks etc. Btw, on the subject of transliterations into ascii, I've seen just as bad stuff done with Bulgarian! :) - FrancisTyers 13:09, 11 March 2006 (UTC)[reply]

Found one more thing. It seems that all incorrect occurences of word Декларација are various translations of Human Rights Declaration. Found that in Serbian language translation, too. They must all be comming from a common source... --Aleksandar Šušnjar 18:41, 11 March 2006 (UTC)[reply]

Actually I emailed them about this, along with errors in the Croatian and Serbian (latin) versions [1] but have received no reply (after about 9 months), UN inefficiency? No joke! :) - FrancisTyers 18:58, 11 March 2006 (UTC)[reply]

There is one more possibility. Weird and improbable but, nevertheless, a possibility. If you print a mixed latin+cyrillic text on a printer that does not have required font(s), definitions of letters (actually theur glyphs) will be uploaded to it. Now, fonts may be (are?) smart. In order to save their space AND design time they may say that lookalike letters share the same glyph in the font file. This also allows it to be uploaded once instead of twice (once for Cyrillic and again for Latin). If such document was printed to a file and then text extracted from it it may produce similar results. This is unconfirmed wild guess but explains why near lookalikes "к" and "k" did not get "confused" - only the perfect lookalies were...

--Aleksandar Šušnjar 21:12, 11 March 2006 (UTC)[reply]

Another redux, not to labour the point into the ground, but there is another possibility... The document had been scanned. I was looking for a POS tagger for Macedonian and came accross this document which mentions a similar problem. "Incorrectly scanned characters were the first problem we sought to solve. The scanner recognized many Cyrillic characters as Latin ones, namely those whose glyph is shaped the same as a Latin one, and this had to be subsequently corrected." - FrancisTyers 14:07, 5 April 2006 (UTC)[reply]

Hmmm... That is quite probably the best hypothesis so far! The only thing that doesn't really fit well any of our guesses is the fact that this did not only happen for Macedonian but also for Serbian and, possibly other languages. Sources of those documents are common, but the question is why would the authors scan their own documents? Maybe they sent it to someone outside to translate and got hard copies only?

--Aleksandar Šušnjar 14:53, 5 April 2006 (UTC)[reply]

Yeah, that would be my guess (they only had a hard copy). I think definately that the people producing the electronic version are different from the people producing the translation. Another possible explanation is that they had the document in electronic format from a scan, e.g. PDF like for Kyrgyz[2] and to get it into a plain text format they had to print it out, scan it and then OCR it. Its a real shame that the UDHR is not unicoded properly for all languages — its a good standardised multilingual aligned text... - FrancisTyers 15:17, 5 April 2006 (UTC)[reply]