Jump to content

Talk:Unicode/Archive 7

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 5Archive 6Archive 7

Number of issues.

I just now edited the Issues section by including the number of identified "issues" with characters (codepoints) (there are, by my count of them in the April 2017 document cited, 94 of them.) I will include them as a copy&paste (with minor editing for brevity) from that article here, it may be helpful.

  • U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE not usually considered a single letter.
  • U+01A2 LATIN CAPITAL LETTER OI LATIN CAPITAL LETTER GHA, not OI
  • U+01A3 LATIN SMALL LETTER OI LATIN SMALL LETTER GHA, not oi
  • U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE ligation of "ts"; not an inverted glottal stop
  • U+0238 LATIN SMALL LETTER DB DIGRAPH ligature, not a digraph
  • U+0239 LATIN SMALL LETTER QP DIGRAPH ligature, not a digraph
  • U+025B LATIN SMALL LETTER OPEN E Latin small letter epsilon [ idk if it is "open" or "closed" see U+025E]
  • U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E Latin small letter closed reversed epsilon (reversed form of U+025B).
  • U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH reversed fishhook r with retroflex hook.
  • U+02C7 CARON hacek
  • U+030C COMBINING CARON combining hacek
  • U+034F COMBINING GRAPHEME JOINER incorrect discription of function; it does not join graphemes
  • U+039B GREEK CAPITAL LETTER LAMDA preferably, but not necessarily, GREEK CAPITAL LETTER LAMBDA
  • U+03BB GREEK SMALL LETTER LAMDA preferably, but not necessarily, GREEK SMALL LETTER LAMBDA
  • U+04A5 CYRILLIC SMALL LIGATURE EN GHE not a decomposable ligature
  • U+04B5 CYRILLIC SMALL LIGATURE TE TSE not a decomposable ligature
  • U+04D5 CYRILLIC SMALL LIGATURE A IE not a decomposable ligature
  • U+0598 HEBREW ACCENT ZARQA Misleading, probably should have been called Hebrew accent tsinnorit
  • U+05AE HEBREW ACCENT ZINOR Should have been called Hebrew accent zarqa (= tsinor)
  • U+0670 ARABIC LETTER SUPERSCRIPT ALEF Not an Arabic letter, but a vowel sign.
  • U+06C0 ARABIC LETTER HEH WITH YEH ABOVE not a letter but a ligature
  • U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE not a letter but a ligature
  • U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE not a letter but a ligature
  • U+0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT SYRIAC SUBLINEAR COLON SKEWED LEFT
  • U+0964 DEVANAGARI DANDA Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
  • U+0965 DEVANAGARI DOUBLE DANDA Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
  • U+0A01 GURMUKHI SIGN ADAK BINDI GURMUKHI SIGN ADDAK BINDI
  • U+0B83 TAMIL SIGN VISARGA This character is actually the aaytham, and is not used as a visarga in Tamil.
  • U+0CDE KANNADA LETTER FA There is no Kannada letter 'fa', this character represents the syllable 'llla'. A formal alias correcting this error has been defined.
  • U+0E9D LAO LETTER FO TAM The name for this character should have been fo sung, but that name is already used for U+0E9F. A formal alias LAO LETTER FO FON correcting this error has been defined.
  • U+0E9F LAO LETTER FO SUNG The name for this character should have been fo tam, but that name is already used for U+0E9D. A formal alias  LAO LETTER FO FAY correcting this error has been defined.
  • U+0EA3 LAO LETTER LO LING The name for this character should have been lo loot, but that name is already used for U+0EA5. A formal alias LAO LETTER RO correcting this error has been defined.
  • U+0EA5 LAO LETTER LO LOOT The name for this character should have been lo ling, but that name is already used for U+0EA3. A formal alias LAO LETTER LO correcting this error has been defined.
  • U+0F0A TIBETAN MARK BKA- SHOG YIG MGO This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").
  • U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG The tsheg mark is not restricted to intersyllabic usage, and would have been better named Tibetan mark tsheg.
  • U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).
  • U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A). A formal alias correcting this error has been defined.
  • U+11EC HANGUL JONGSEONG IEUNG-KIYEOK U+11EC HANGUL JONGSEONG YESIEUNG-KIYEOK
  • U+11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK U+11ED HANGUL JONGSEONG YESIEUNG-SSANGKIYEOK
  • U+11EE HANGUL JONGSEONG SSANGIEUNG U+11EE HANGUL JONGSEONG SSANGYESIEUNG
  • U+11EF HANGUL JONGSEONG IEUNG-KHIEUKH U+11EF HANGUL JONGSEONG YESIEUNG-KHIEUKH
  • U+156F CANADIAN SYLLABICS TTH There is no 'tth' syllable. A better name would have been Canadian Syllabics asterisk.
  • U+178E KHMER LETTER NNO As this character belongs to the first register, its correct transliteration is nna, not NNO.
  • U+179E KHMER LETTER SSO As this character belongs to the first register, its correct transliteration is ssa, not SSO.
  • U+200B ZERO WIDTH SPACE This isn't a "space". It is an invisible character that can be used to provide line break opportunities.
  • U+2113 SCRIPT SMALL L Despite its character name, this symbol is derived from a special italicized version of the small letter "L".
  • U+2118 SCRIPT CAPITAL P Should have been called calligraphic small p or Weierstrass elliptic function symbol, which is what it is used for. It is not a capital "P" at all. A formal alias correcting this to WEIERSTRASS ELLIPTIC FUNCTION has been defined.
  • U+234A APL FUNCTIONAL SYMBOL DOWN TACK UNDERBAR named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
  • U+234E APL FUNCTIONAL SYMBOL DOWN TACK JOT named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
  • U+2351 APL FUNCTIONAL SYMBOL UP TACK OVERBAR named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
  • U+2355 APL FUNCTIONAL SYMBOL UP TACK JOT named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
  • U+2361 APL FUNCTIONAL SYMBOL UP TACK DIAERESIS named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
  • U+2448 OCR DASH MICR ON US SYMBOL
  • U+2449 OCR CUSTOMER ACCOUNT NUMBER MICR DASH SYMBOL
  • U+2629 CROSS OF JERUSALEM cross potent. The actual cross of Jerusalem is a cross potent with a small crosslet added at each corner.
  • U+262B FARSI SYMBOL This symbol is so named because as symbol of Iran it cannot be encoded in ISO standards.
  • U+2B7A LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
  • U+2B7C RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
  • U+3021 HANGZHOU NUMERAL ONE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3022 HANGZHOU NUMERAL TWO HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3023 HANGZHOU NUMERAL THREE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3024 HANGZHOU NUMERAL FOUR HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3025 HANGZHOU NUMERAL FIVE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3026 HANGZHOU NUMERAL SIX HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3027 HANGZHOU NUMERAL SEVEN HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3028 HANGZHOU NUMERAL EIGHT HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+3029 HANGZHOU NUMERAL NINE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
  • U+327C CIRCLED KOREAN CHARACTER CHAMKO An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
  • U+327D CIRCLED KOREAN CHARACTER JUEUI An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
  • U+A015 YI SYLLABLE WU a syllable iteration mark, not a syllable "wu"
  • U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E unified CJK ideograph, not compatibility ideograph
  • U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F unified CJK ideograph, not compatibility ideograph
  • U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11 unified CJK ideograph, not compatibility ideograph
  • U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13 unified CJK ideograph, not compatibility ideograph
  • U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14 unified CJK ideograph, not compatibility ideograph
  • U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F unified CJK ideograph, not compatibility ideograph
  • U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21 unified CJK ideograph, not compatibility ideograph
  • U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23 unified CJK ideograph, not compatibility ideograph
  • U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24 unified CJK ideograph, not compatibility ideograph
  • U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27 unified CJK ideograph, not compatibility ideograph
  • U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28 unified CJK ideograph, not compatibility ideograph
  • U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29 unified CJK ideograph, not compatibility ideograph
  • U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET A spelling error: "brakcet" should be "bracket". A formal alias correcting this error has been defined.
  • U+FEFF ZERO WIDTH NO-BREAK SPACE Byte Order Mark (Naming it ZWNBSP was a mistake from the start.)
  • U+122D4 CUNEIFORM SIGN SHIR TENU CUNEIFORM SIGN NU11 TENU
  • U+122D5 CUNEIFORM SIGN SHIR OVER SHIR BUR OVER BUR CUNEIFORM SIGN NU11 OVER NU11 BUR OVER BUR
  • U+1B001 HIRAGANA LETTER ARCHAIC YE The preferred name is HENTAIGANA LETTER E-1
  • U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS U+1D0C5 BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
  • U+1D300 MONOGRAM FOR EARTH U+1D300 MONOGRAM FOR HUMAN
  • U+1D301 DIGRAM FOR HEAVENLY EARTH U+1D301 DIGRAM FOR HEAVENLY HUMAN
  • U+1D302 DIGRAM FOR HUMAN EARTH U+1D302 DIGRAM FOR EARTHLY HUMAN
  • U+1D303 DIGRAM FOR EARTHLY HEAVEN U+1D303 DIGRAM FOR HUMANLY HEAVEN
  • U+1D304 DIGRAM FOR EARTHLY HUMAN U+1D304 DIGRAM FOR HUMANLY EARTH
  • U+1D305 DIGRAM FOR EARTH U+1D305 DIGRAM FOR HUMANLY HUMAN

--sorry my Copy& paste does not retain the two columns there were in. 75.90.36.201 (talk) 20:06, 9 April 2018 (UTC)

I formatted your previous edit, then counted the number of asterisks ("*"s) in the source text. 94 seems to be the correct number. However this is on the verge of Original research. And will you track changes made to Unicode Technical Note #27? Love —LiliCharlie (talk) 20:33, 9 April 2018 (UTC)
I understand that there are (at least) 12 code-points which represent non-existent "characters" (also known as "ghost characters"). 妛挧暃椦槞蟐袮閠駲墸壥彁 are according to https://www.dampfkraft.com/by-id/a824aa10/#A-Spectre-is-Haunting-Unicode meaningless and NOT part of any language. In addition (as of 3/12/2018) parts of the issues section have been removed which in my view amounts to vandalism. The most egregious removal is all mention of the (politically motivated) concessions Unicode Consortium made to various nations because they claimed (rather than the experts of the relevant languages) to be the authoritative source of the language. The current article white-washes this (to some extent) by implying that some of these disagreements are over "ancient" or "obsolete" language elements when in fact some of them are in current (but "unofficial") use. Also, I vote that the 94 (or 106 if the above dozen aren't included) issues should be listed in the article (as a collapsed table, sortable by code-point name or U-number.72.16.99.93 (talk) 22:18, 3 December 2018 (UTC)
None of this is relevant. A bunch of these are controversial; after much discussion, the Wikipedia article is at caron, not hacek. The complaint you have about the APL characters says "named according to the Bosworth convention", which is a choice, not a mistake. Even the clear errors are irrelevant; we barely mention that Byzantine music and hentaigana are supported, thus stressing about the naming of one of the characters, a name that will have little effect on users, is beneath mention. Nobody will use 10% of Unicode's characters; it's not a real problem that there are 12 characters that have no real use.
Editing the issues section is not vandalism; it's people disagreeing with you. I'm not even sure what you're talking about; the last three months has had no changes to the issue section.
(Please don't use xx/xx/20xx date formats; they're inherently ambiguous, as a significant number of readers will interpret them as month/day and a significant number will interpret them as day/month.)--Prosfilaes (talk) 21:46, 24 December 2018 (UTC)

Other persisting "anomalies"

The "combining class" priorities assigned to Hebrew diacritics in the early 1990s are incorrect and semi-worthless, which means that older software displays the diacritics incorrectly, while more recent software has to work around it, but apparently this is also set in stone, and nothing can be done to fix it... AnonMoos (talk) 03:06, 4 February 2019 (UTC)

Names or glyphs? Response to Prosfilaes

Prosfilaes has reverted my replacement of code point names with glyphs, holding that "in explaining the architectures, names are more important than glyphs". I disagree. The official names play no role in the structure of Unicode. Some code points like U+0009, the tab character, do not even have official names and, of those that do, some are incorrect (see above) and others, like LATIN SMALL LETTER Q (which displays a capital letter that seemingly claims to be small) are confusing. The Unicode Standard nowhere says that anything depends on the name of a code point.

A code point with a graphic "basic type", which most of the assigned code points have, determines the general shape of its associated glyphs. The additional designation of a font makes the shape precise, and adding the point size completes the glyph specification. Code points are of interest mainly because of this association with glyphs.

In lower case, the Greek letter sigma has two code points, U+03C2 and U+03C3. The second applies when the letter occurs at the end of a word, the first when it occurs elsewhere. Why two, when it's the same letter, pronounced the same way? Only because the shape is not even roughly the same, ς for U+03C2 and σ for U+03C3. Glyphs that differ so radically can never represent the same code point. Unlike anything having to do with official names, this is a basic feature of Unicode architecture.

In contrast, the exclamation mark ' ! ' is used for the factorial function in mathematics as well as a punctuation mark ending a sentence emphatically. These are two very different uses with nothing in common but the glyph in each applicable font, yet they have the same code point, U+0021. They are not distinguished in Unicode because the distinction has no consequence for glyphs.

One cannot always use a glyph to designate a code point uniquely. The glyph ' P ' can represent U+0050 (the first letter in Prosfilaes' username and mine), U+03A1 (the Greek letter rho), or U+0420 (the Cyrillic letter er). Unique designation is usually possible, though, and—when it is—presenting glyphs as I did in the reverted text is more helpful to the average reader than is presenting the name.

Prosfilaes also complains that 𑀈, my example of a non-BMP character, looks too much like a plus sign, which is in the BMP. That hadn't occurred to me, but another non-BMP code point could certainly be used.

Peter Brown (talk) 16:55, 20 February 2019 (UTC)

Unicode encodes characters, not glyphs. Identical glyphs may be used to represent different characters (as, typically, U+0041 A LATIN CAPITAL LETTER A, U+0391 Α GREEK CAPITAL LETTER ALPHA, and U+0410 А CYRILLIC CAPITAL LETTER A), and completely different glyphs may represent the same character (U+0041 A LATIN CAPITAL LETTER A may look like 𝖠, 𝒜, 𝔄, etc.).
Specifically, typical glyphs representing the character U+00F7 ÷ DIVISION SIGN can easily be confused with U+2797 HEAVY DIVISION SIGN, U+1365 ETHIOPIC COLON or U+223B HOMOTHETIC, while the "two-dot shape" of U+11008 𑀈 BRAHMI LETTER II looks like U+A58C VAI SYLLABLE JOO, and its "four-dot shape" resembles U+2E2C SQUARED FOUR DOT PUNCTUATION, U+2237 PROPORTION, U+26DA DRIVE SLOW SIGN, U+2D46 TIFINAGH LETTER TUAREG YAKH, U+1362 ETHIOPIC FULL STOP, and several of the Braille patterns.
There is no way to confidently identify an isolated character when you only see a glyph that visualises it. It is necessary to give its semantics which in most cases is reflected by its character name. Love —LiliCharlie (talk) 18:54, 20 February 2019 (UTC)
I think this is far over broad for the edit in question. The dispute is between "For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD)." and " For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈)." As I said, this is about architecture. Yes, there are confusing names for certain Unicode characters, but this is about how many digits are used to represent that Unicode character. It doesn't matter at this layer if a name is confusing or how it might map to glyphs or user-perceived characters; just that there exists a code point labeled LATIN CAPITAL LETTER X and that it is also referenced as U+0058.--Prosfilaes (talk) 21:20, 20 February 2019 (UTC)
Overbroad, perhaps, but I do want to respond to LiliCharlie, who claimed that "completely different glyphs may represent the same character", challenging my claim that "Glyphs that differ...radically can never represent the same code point." As support, LiliCharlie writes, "U+0041 A LATIN CAPITAL LETTER A may look like 𝖠, 𝒜, 𝔄, etc." This is supportive, however, only if LiliCharlie can name fonts in which U+0041 has 𝒜 and 𝔄, respectively, as glyphs. As far as I can determine, 𝒜 has code point U+1D49C and 𝔄 has code point U+1D504. A font in which U+0041 has 𝒜 as a glyph would hardly be a sufficient challenge anyhow, as this is quite similar to A. 𝔄 admittedly differs radically, so a font in which it represents U+0041 would definitely count against my claim.
I challenge LiliCharlie to explain why, in lower case, medial sigma (σ) and final sigma (ς) are assigned different code points while medial and final lower-case theta (θ) both have the one code point U+03D8. The obvious answer, though there may be another, is that the glyphs for lower-case sigma, in most or all applicable fonts, are very different.
Returning to the original dispute with Prosfilaes, the choice is between
For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).
and
For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈).
At least for the sake of argument, I concede Prosfilaes' point that 𑀈 looks too much like a plus sign and propose, in the final parenthesis,
e.g. U+2395C for the character 𣥜
The real question is which phrasing is more accommodating for the typical reader. I disagree that "It doesn't matter at this layer if a name is confusing"; confusing names will confuse, which contravenes Wikipedia's objectives. The use of capitals is off-putting, especially as the reader has not been advised earlier (or, indeed, anywhere in the article) that letters in official Unicode names have to be capitalized. There is no explanation of what a language tag is; the phrase is simply sprung on the unsuspecting reader. Likewise with "private use", a phrase appearing in a quote from Joe Becker but never explained.
For readers already acquainted with Unicode conventions, these considerations are not relevant. Such folks, however, are not the intended audience for Wikipedia articles.
Peter Brown (talk) 23:53, 21 February 2019 (UTC)
Luthersche Fraktur was the first one I found, and a search for Fraktur fonts show that many of them use the glyph form 𝔄.
We're talking about code points, not characters. You're adding confusion by saying "For code points" and then saying "the character ÷". I think you're underestimating the type of reader who is reading this article, or underestimating the difficulty of the rest of the article. The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb. Anyone with any familiarity with character encoding in computers will expect that there's control characters in Unicode, like LANGUAGE TAG.
I object to the use of 𣥜, since that implies that Chinese is outside the BMP. Hieroglyphs or other clearly ancient script, that's completely outside the BMP, should be used, or possibly an emoji. You're giving up the ability to show a six-digit name if you insist on using characters.--Prosfilaes (talk) 01:13, 22 February 2019 (UTC)

@Peter Brown: 1. There are two major reasons why U+03C2 ς GREEK SMALL LETTER FINAL SIGMA and U+03C3 σ GREEK SMALL LETTER SIGMA were encoded separately. The first, and already sufficient, one was to ensure round-trip compatibility with encodings that had existed before Unicode, and in which the two characters were also encoded separately. And reason number two is that there are exceptions to the rule that ⟨ς⟩ is used word-finally and ⟨σ⟩ elsewhere, see Nick Nicholas's Sigma: final vs. non-final which is part of the Thesaurus Linguae Graecae project. — 2. The Fraktur smart font I most often use is UnifrakturMaguntia. Its glyph for U+0041 A LATIN CAPITAL LETTER A is, of course, similar to 𝔄. Love —LiliCharlie (talk) 10:47, 22 February 2019 (UTC)

@Prosfilaes:
I don't see how I'm adding confusion by saying "For code points" and then saying "the character ÷". Saying "For code points" and then saying "the character LATIN CAPITAL LETTER X" is no less guilty of confusing code points with characters. The English letter string 'LATIN CAPITAL LETTER X' is neither a code point nor a character, nor is the glyph '÷'. Both only designate characters. '÷' has the advantage that it does not presuppose any familiarity with Latin or any other well-known script. Further, any reader who is familiar with Latin would take exception to "LATIN CAPITAL LETTER W", an official Unicode name, since Latin did not have a W. Better just to refer to "the capital letter W".
You write:
The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb.
This is hardly necessary. An encyclopedia is supposed to tell the reader things, not just expose them to usages. Even if this information is added to the article, though, "the character LATIN CAPITAL LETTER X" will strike the reader—strikes me, anyhow—as odd, since a letter string is not a character. Referring to "the English character X", (thereby distinguishing it from the Greek character Χ) would be much better.
Yes, one expects control characters, but why not something with a name familiar to the typical reader like the carriage return U+000D?
As you say, a hieroglyph would be preferable to 𣥜.
@LilliCharlie: Point taken.
Peter Brown (talk) 19:19, 22 February 2019 (UTC)
Thousands of Wikipedia articles refer to Unicode characters by their official names in capitalized form. The reason for this is that the names are unique and normatively identify the character referred to. If we were to abandon the official Unicode character names and devise our own names (which would be original research) then there would be endless disputes about the names. You prefer to refer to "X" as "English character X" yet you must know that X is used for hundreds of other languages, so referring to "X" as an "English character" would be totally unacceptable — which is why LATIN CAPTIAL LETTER X is so much better way of referring to the character. BabelStone (talk) 21:42, 22 February 2019 (UTC)

Why is Latin "so much better" than English? Granted, the English and Latin X is also the German and Swedish X, but we need to apply some adjective—Latin, English, German, whatever—to distinguish it from the Greek Χ, which really is a different character. In en.wikipedia.org, the character can be clearly designated as the "English character X". In sv.wikipedia.org, it would be clearer to call it the "Svenska bokstaven X". Neither is "totally unacceptable".

Choosing a locution maximally clear to the expected reader is not original research. It is not research at all. Even misspelling "capital", as you did above, engenders no problem—we all know what you meant.

Peter Brown (talk) 23:36, 22 February 2019 (UTC)

Latin, especially LATIN, is much better than English, because the English character X seems to label something English-specific, where as Latin is more likely to be taken as referring to Latin script; even if you're not familiar with that phrase, most people should recognize Latin is the ancestor of our script and take it as generic.
I think the question comes down to learning styles, and while I'm not sure mine is better, I do think it's more encyclopedic to separate levels and talk here about the code-point level and how you write code points, like U+0050, without trying to drag in what the code points mean here. --Prosfilaes (talk) 06:01, 23 February 2019 (UTC)
This must be a joke. While there are letters of the English alphabet (≈Latin letters regularly used in English) and punctuation marks regularly used in English, there is nothing like an "English X", a "Commonwealth English Æ" (as in encyclopædia) or an "English full stop/​English period." The ⟨X⟩ in Xi'an is beautiful.” is neither a "Chinese Pinyin X" nor an "English X"; it's just the Latin script capital letter X that is a common element of the English, the Chinese Pinyin, the Latin, and many other writing systems. Love —LiliCharlie (talk) 13:34, 23 February 2019 (UTC)

Once again, the wording in question has read:

For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).

This violates MOS:ALLCAPS, according to one should use capital letters for Unicode names only "when presenting tables of Unicode data, and when discussing code point names as such. Otherwise prefer unstyled, plain-English character names". The passage in question is a discussion of the designation of code points in the 'U+' format, not of code point names as such.

Adopting Prosfilaes suggestion that a hieroglyph be used and acknowledging LiliCharlie's objection to "the English X", I am bringing the passage into accord with the MOS by replacing it with

For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character 'X' in English and related languages); for code points outside the BMP, five or six digits are used, as required (e.g. U+13254 for the Egyptian hieroglyph '').

Peter Brown (talk) 19:06, 24 February 2019 (UTC)

In full, the bullet point in MOS:ALLCAPS relevant to Unicode reads:

The names of Unicode code points are conventionally given in small caps (tip: enter the name in all caps into the template {{sc2}}). Example: the character (U+2053, SWUNG DASH). This is only done when presenting tables of Unicode data, and when discussing code point names as such. Otherwise prefer unstyled, plain-English character names (whether they coincide with code point names or not): the hyphen and the en dash, not the HYPHEN-MINUS and the EN DASH.

The Unicode article currently contains the text:

For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X)

This is contrary to MOS, as the discussion contains the all-caps text LATIN CAPITAL LETTER but does not present a table and is not about code point names as such but rather about a standard way of designating code points, one that involves hexadecimal digits.

I replaced this with text that does conform to MOS. LiliCharlie has reverted it, restoring the nonconforming code. Though the associated edit summary correctly quotes MOS:ALLCAPS as saying "The names of Unicode code points are conventionally given in small caps", the convention in question is provided by the Unicode Standard and the MOS spells out a different convention to be followed in Wikipedia articles. With some clearly-specified exceptions, we are forbidden to use code-point names in the manner prescribed by the Unicode Standard. Rather, we are instructed to use plain-English character names whether they coincide with code point names or not. Editors are welcome to improve on the phrase I used, "the character 'X' in English and related languages", perhaps referring to the Latin ancestry of the character, but such emendations should still conform to the MOS.

Peter Brown (talk) 21:47, 24 February 2019 (UTC)

That's clearly wrong, or at best confusing, since the hyphen-minus and the hyphen are two totally different things. In any case, we should not bring in plain English character names because we're not talking about plain English characters. If necessary, I'm fine with removing the names altogether; they're not needed for the example.--Prosfilaes (talk) 01:34, 25 February 2019 (UTC)
Well, we can use plain English referring expressions, can't we, even we don't call them "names"? And the reader would surely appreciate seeing glyphs to get some idea what we're talking about; these could be put in parentheses. How about the following?
For code points in the Basic Multilingual Plane (BMP), four digits are used, e.g. U+00F7 for the division sign (÷); for code points outside the BMP, five or six digits are used as required, e.g. U+13254 for the Egyptian hieroglyph designating a winding wall (  ).

Peter Brown (talk) 19:49, 26 February 2019 (UTC)

Version 12.1: new Japanese era name (2019-05-01)

Version 12.1 adds U+32FF SQUARE ERA NAME REIWA "to enable software to be rapidly updated to support the new Japanese era name in calendrical systems and date formatting. The new Japanese era name was officially announced on April 1, 2019, and is effective as of May 1, 2019." [1] -DePiep (talk) 22:08, 9 July 2019 (UTC)

New: Unicode alias names and abbreviations

FYI: I have created article Unicode alias names and abbreviations. IMO it is very complete, both wrt formal aliases (the 5 reasons), and the informal ones (used by Unicode e.g. in charts, but not formalised/listed). Personally: I have been working on this a long time to get it right (in age not time spend ;-) ). Mainly to get a complete list of abbr's-in-unicode. -DePiep (talk) 00:57, 19 October 2019 (UTC)

What is a Unicode font?

Google search has already picked this up so lest skim readers be misled I have stricken out proposals that I have withdrawn. --Red King (talk) 11:24, 15 October 2019 (UTC) @Peter M. Brown:, @LiberatorG:, @Drmccreedy:

I wrote

The term 'Unicode font' is used more specifically to categorise those fonts that have implementations for every character in the repertoire (or at least a large (65,535) subset of it.

I am very open to suggestions about how better to say this. First of all, I take it that my Unicode is not in principle concerned with fonts per se, seeing them as implementation choices. Any given character may have many allographs, from the more common bold, italic and base letterforms to complex decorative styles. is not disputed.

A font (according to that article) is "was a particular size, weight and style of a typeface in hot-metal typesetting" and in modern terminology can be taken as synonymous with a typeface. So, it seems to me, a "Unicode font" is literally a contradiction in terms: Unicode is a database of numeric values associated with letterforms no matter how drawn, it is not a font or a typeface. The font expresses in vectors how an an artist (typographer) has chosen to draw a letterform associated with that number. But the industry has adopted the term "Unicode font" to mean a font that has at least most of the characters in the basic plane. Is there a better way to express that succinctly than my proposal? --Red King (talk) 23:21, 11 October 2019 (UTC)

No, the "industry" (whatever that is) has not adopted the term "Unicode font" to mean a font that has at least most of the characters in the basic plane. A Unicode font is simply a computer font that has a cmap table that maps glyph IDs to Unicode code points. The term "pan-Unicode" is sometimes used to refer to a Unicode font that covers a high proportion of Unicode characters, but given that TTF/OTF fonts have a maximum limit of 65,535 glyphs, and Unicode 12.1 defines 137,766 graphic characters, it is nigh on impossible for a font to cover all Unicode characters. BabelStone (talk) 23:32, 11 October 2019 (UTC)
I had hoped that your reference to Cmap would point the way to an improved definition, but unfortunately it just takes us around in a circular argument. "defines the mapping of character codes to the glyph index values used in the font."[1] I accept that my definition is uncited (though I have great difficulty accepting a font with less than 2000 glyphs as a "Unicode font" - just because it calls itself a Unicode font surely doesn't make it one. Is there an RS definition anywhere?--Red King (talk) 23:41, 11 October 2019 (UTC)
  1. ^ "cmap – Character To Glyph Index Mapping Table – Typography". docs.microsoft.com.
It is not something as basic as a font that supports multi-byte representation (rather than single byte), is it? (8-but bytes)--Red King (talk) 00:04, 12 October 2019 (UTC)
A Unicode font could mean a lot of things; why wouldn't a font that supports glyphs indexed by Unicode point be called a Unicode font? Certainly if we were talking about a script supported by fonts that put their glyphs over the ASCII range, a font that used instead Unicode code points could productively be called a Unicode font. I certainly wouldn't talk about number of glyphs; a GB 2312 font with over 7,000 characters is less of a Unicode font than a Latin-Greek-Cyrillic font that richly supports all features only Unicode offers.--Prosfilaes (talk) 10:38, 12 October 2019 (UTC)
I think you've fallen into the same trap as I did, that it is not a "proper" Unicode font unless it has a substantial repertoire. That is a qualitative judgement rather than a functionality judgement. [I have accepted that my original text was incorrect, btw. But something like it needs to be said, the question now is what?] According to the (uncited!) statement that opens the Unicode fonts article,

A Unicode font is a computer font that maps glyphs to Unicode characters (i.e. the glyphs in the font can be accessed using code points defined in the Unicode Standard).

So if that is literally correct, a font that has no more than the most basic ASCII character-set would qualify as a Unicode font if "the glyphs in the font can be accessed using code points defined in the Unicode Standard". Is this true? Why? Why not? --Red King (talk) 10:51, 12 October 2019 (UTC)
Before Unicode we used fonts with a maximum of <256 glyphs. When we mixed scripts we had one font for Latin, another one for Cyrillic, yet another one for Greek, one for Hebrew, etc. (The encodings used are now known as "legacy encodings.") So different characters were mapped to the same code point, and formatting was essential for a text to make sense. (However my word processor sometimes "lost formatting," which resulted in gibberish, or mojibake.) Unicode was revolutionary as formatting was no longer needed for texts to make sense (see article Plain text), and we actually called any font containing a table to map gylphs to Unicode code points a Unicode font. This even applied to Latin-only fonts, because it was not clear which script a font handled, and even if a font was known to be a "Latin font," the range above (<128-character) ASCII was still legacy-encoded, so non-basic Latin letters like the ⟨é⟩ in résumé were mapped to different code points by different software developers. Love —LiliCharlie (talk) 14:34, 12 October 2019 (UTC)
Yes, I knew that, see ISO Latin-1, ISO Latin 2 etc. I'm afraid this all reminds me of the HD Ready scam. So let me be provocative and propose a new draft second sentence. For convenience, I'll open a new subsection. --Red King (talk) 15:05, 12 October 2019 (UTC)
Not necessarily substantial, but something that's uniquely Unicode.
You're imagining that Unicode font means something clear and unambiguous. It doesn't. Certainly, a font for a script might be called a Unicode font if there were a previous tradition of non-Unicode-compatible fonts, even if it only covered a 7-bit charset. But Unicode font would generally only be a marketing term.--Prosfilaes (talk) 05:53, 13 October 2019 (UTC)
But Unicode font would generally only be a marketing term. Precisely and that is what I believe that the article should say explicitly. Caveat emptor presupposes an informed emptor. As well as the article telling readers what Unicode is, it should also tell them what it is not. --Red King (talk) 13:19, 13 October 2019 (UTC)
I don't see any need to change the current section at all. Saying what something is not is rarely very productive; Unicode is not a fox, it's not a box, it is not rain, it is not a train, etc.--Prosfilaes (talk) 13:28, 13 October 2019 (UTC)

Suggested draft second second sentence

For a font to be described legitimately as a "Unicode font", it is only required that the glyphs in the font can be accessed using code points defined in the Unicode Standard. There is no minimum number of characters that must be included in the font; some fonts have quite a small repertoire.

Better? --Red King (talk) 15:05, 12 October 2019 (UTC)

That's my understanding but I'd like to see a citation to back it up. DRMcCreedy (talk) 15:40, 12 October 2019 (UTC)
Yes, absolutely, I agree, indeed I would have to say it is a prerequisite before it can be added but I've not yet found one. I've based it closely on the opening sentence of Unicode font, but it is not cited there either.
I dont't think we should tell users what "a font to be described legitimately as a 'Unicode font'" is. The term "Unicode font" has been used in various senses, and it is not up to Wikipedia users to instruct other users which usage is the legitimate one, classifying any other usage as illegitimate. We should report what experts agree on, not define new standards of legitimacy. Love —LiliCharlie (talk) 22:39, 12 October 2019 (UTC)
That's what I was thinking but couldn't articulate. DRMcCreedy (talk) 23:32, 12 October 2019 (UTC)
Yes, I accept that. I guess "validly" has the same issue. What I am trying to put encyclopedialy is that any font, no matter how small its repertoire, qualifies as a unicode font if it meets the technical specification. But the need for citation increases if anything. --Red King (talk) 13:12, 13 October 2019 (UTC)
Version by version, an increasing number of characters are assigned code points in the Unicode Standard. On the proposed definition, a non-Unicode font, even an obsolete one supporting nothing in the BMP, can become a Unicode font without any reworking of the font. Is that an acceptable consequence? Peter Brown (talk) 16:30, 12 October 2019 (UTC)
I share your sentiment but I can't see any reasonable basis to exclude them unless we find a citation that says yea or nay. --Red King (talk) 22:05, 12 October 2019 (UTC)

Possible citations

Bigelow and Holmes

Is this acceptable as a suitable citation?: To call an incomplete font containing Unicode subsets a ‘Unicode’ font could be misleading, since some users could mistakenly assume that any font called ‘Unicode’ will contain a full set of 28,000 characters.[1] The problem is that the paper is about designing Lucida Sans Unicode and describes the authors' attempt to create a font that has a consistent style irrespective of alphabet. Other articles I have read are quite disparaging about that idea, saying that there are major cultural differences between writing systems. These writers promote the idea that it is better to have a font that is optimised for the language in which a document is written, that it doesn't matter if it is "incomplete". I conclude also that while a 'pan-Unicode' [deprecated term!] font might have been a credible objective with 28000 characters in 1993, surely it is no longer so? So this citation might fail NPOV even if it passes WP:VNT.

Unicode Consortium FAQ

Interestingly, the Consortium prefers the phrase "Unicode conformant font". A Unicode-conformant font can be defined as a font which contains a mapping from Unicode characters and that maps characters to glyphs in a way that is consistent with character semantics defined in the Unicode Standard.[2] Note that the FAQ says nothing about comprehensiveness. (I propose to append this citation to the opening sentence of Unicode fonts).

Further comments (and citations!) welcome. --Red King (talk) 16:34, 13 October 2019 (UTC)

References

  1. ^ Bigelow, Charles; Holmes, Kris (September 1993). "The design of a Unicode font" (PDF). ELECTRONIC PUBLISHING. VOL. 6(3), 289–305: 298. {{cite journal}}: |volume= has extra text (help)
  2. ^ "Fonts and keyboards". Unicode Consortium. 28 June 2017. Retrieved 13 October 2019.

Second proposal for second sentence

Another attempt:

A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode Standard.[1] The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire.

References

  1. ^ "Fonts and keyboards". Unicode Consortium. 28 June 2017. Retrieved 13 October 2019.

Note that FAQ uses the term "compliant", which has the intent of my earlier "legitimate" without the overtones. Better? --Red King (talk) 11:24, 15 October 2019 (UTC)

  • Comment. The wikilinks are somewhat stange. We don't need a link to Unicode Standard, as this is the article that deals with that topic. Our first link to code point currently seems to be in the Architecture and terminology section, though the term is used 12 times above that section.
The smallest Unicode compliant font of my private collection is Noto Sans Tagbanwa v. 2.000 which has 29 gylphs (including 5 control and zero-width characters, 2 spaces, 2 punctuation marks, and the dotted circle U+25CC) that can be accessed via 28 Unicode characters. That appears to be all that is needed to write the language of the homonymous ethnicity. Is it really that remarkable that some scripts require fewer glyphs and characters than a Unicode compliant font for English? In other words: Do we really need the second sentence? Love —LiliCharlie (talk) 01:58, 19 October 2019 (UTC)
True, I did those [links] for clarity in this talk page, they would come out before going live. Good observation that it had been used earlier but not wlinked, I hadn't noticed that. --Red King (talk) 20:48, 19 October 2019 (UTC)
I can agree removing the second sentence here. There is not requirement in this, so no need to suggest it. -DePiep (talk) 12:28, 19 October 2019 (UTC)
If you look back at the earlier discussion, where some editors fell into the trap (led, I admit, by me but I see from the journal article I cited that it is a common error) of believing that a font cannot be a Unicode font unless it as many thousands of glyphs. I really do believe that we should say this but if the consensus is that it breaks the X is not a fox or a box rule, then I will have to concede. --Red King (talk) 20:48, 19 October 2019 (UTC)
  • Oppose. A font can map glyphs to Unicode code points, but map the wrong glyphs (e.g. map a "B" glyph to U+0041), in which case the font is not compliant or conformant to the Unicode Standard. BabelStone (talk) 13:20, 19 October 2019 (UTC)
I'm sorry, I don't really understand what you are saying here, could you elaborate please? I believe that I have paraphrased the FAQ at Unicode.org: your challenge seems (to me!) to be saying that you could have a font that is compliant but not legitimate (sic!). Surely we don't have to get bogged down in the possibility that someone might design, let alone expect to sell, a font that complies with the letter of the standard but contravenes its spirit? Our purpose is to explain, not a write a criminal code. --Red King (talk) 20:48, 19 October 2019 (UTC)

"Unicode compliant font"

Enough. Per OP "what is a Unicode font" etcetera: that does not exist. OTOH, a Unicode compliant font is well defined. So that is what enwiki should say. The article (with page content) is Unicode compliant font. -DePiep (talk) 21:00, 19 October 2019 (UTC)

Per WP:Common name, "Unicode font" is the term most widely seen. As of 20 October 2019, Wikipedia doesn't even have an article called "Unicode compliant font". This article should refer readers to Unicode font (or alias) for more detailed information, but it needs at least two sentences to give a reason why they should do that. Which is what all of the above is about. --Red King (talk) 13:39, 20 October 2019 (UTC)

Definitions

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
"The discussion is over. You have anything to say, please add new section"

This section primarily discuss this edit made by Peter M. Brown (talk · contribs)
The versions of the section being discussed (for comparison):
Please use {{re}} to notify participants
Pease use {{re|Alexander_Davronov|label1=<span><span style='color:#a8a8a8'>DAVRONOV</span><span style="color:#000">A.A.</span></span>}} to notify @DAVRONOVA.A.:

@Peter M. Brown: I'm going to revert back the changes you've made recently and just wanted to know your objections.
@Peter M. Brown: ... to conform more closely to the Unicode Standard Previous version (version1) repeated terms almost word by word. Now it doesn't.
@Peter M. Brown: ... Deleted the inclusion of a long quote in a reference It is much more convenient to have it directly inside the article. No reason to delete it.
@Peter M. Brown: ... Only one reference is needed per paragraph. Wiki doesn't impose restrictions on number of sources. If they are reliable removing them is generally bad idea.
@Peter M. Brown: ... "hex" is nonstandard There is some recommendations on it: MOS:RADIX I'm going to replace it to 16 or just add 0x before the number. DAVRONOVA.A. 21:42, 16 November 2019 (UTC)


@DAVRONOVA.A.: I particularly agree with your third point. If they are non-redundant and reliable sources, having more than one source per claim is a good thing. BernardoSulzbach (talk) 16:18, 17 November 2019 (UTC)
Could someone explain how this {{re}} template is supposed to work? DAVRONOVA.A. provided a list of objections to an edit of mine in a post yesterday, and I just now happened on it, being mildly curious about the new section in Talk:Unicode. I’m more than willing to respond and will try to do so by 00:00, 19 November (UTC), but there was no notice in my "alerts", on my user talk page, or in my emails. Am I expected to look at everything that appears on my Watchlist? Peter Brown (talk) 22:30, 17 November 2019 (UTC)
@Peter M. Brown: It might have to do with your notification settings in Special:Preferences. BernardoSulzbach (talk) 01:23, 18 November 2019 (UTC)
Many thanks! I didn't know about that page. I've now checked "Notify me when someone links to my user page" and "email". That should work. Peter Brown (talk) 03:21, 18 November 2019 (UTC)
Taking up the objections in turn:
  • The previous version repeated terms almost word by word. Now it doesn't.
Almost, yes, but the differences are important. According to the previous version,
Unicode defines a codespace – set of numbers/integers used to encode characters in the range of 0 to 10FFFFhex.
The reference is to Unicode's "Glossary", according to which a codespace is "A range of numerical values available for encoding characters." The text in my revised version is word-for-word identical. The "Glossary" does go on to note that for the Unicode Standard the range is 0 to 10FFFF16. The previous version of the Wikipedia subsection omits this qualification, incorrectly implying that in general a codespace, regardless of the encoding standard, has this as a range. In the next sentence of my revised version, I do specify the range for Unicode.
  • It is much more convenient to have it [the long quote incorporated in a footnote] directly inside the article. No reason to delete it.
The quote in question was not directly inside the article but rather in a footnote. According to WP:FOOTNOTES, footnotes have two purposes: documenting an article's sources and providing tangential information. For the first purpose, the citation is sufficient without the quotation. For the second, the information in the quote is mostly not "tangential" as it repeats information in the main text.
The only information in the footnote that is not in the main text is a definition of an "encoded character". If this phrase needs to be defined, which I doubt, putting the definition in the Architecture and terminology section is too late, as the phrase has already been used three times, including a use in the lead section. I agree with the author of the lead that the average reader does not need a definition. Anyhow, the article often refers to code points, text, and scripts, rather than characters, as being encoded.
  • Wiki doesn't impose restrictions on number of sources. If they are reliable, removing them is a bad idea.
Agreed. The paragraph has two sources, both from the Unicode Standard, the "Glossary" and the section "Code Points and Characters". This remains true.
  • Response to "hex is nonstandard."
This is uncontroversial. I have replaced "10FFFFhex" with "hexadecimal 10FFFF", which will be clear to many readers. Other readers can follow the link to hexadecimal. According to MOS:RADIX, the use of subscripts for numerals not in base 10 is limited to articles that are not computer oriented. If an editor uses prefixes such as 0x then, per the same MOS subsection, the editor must "Explain these prefixes in the article's introduction or on first use." In any case, the previous version of the Wikipedia article requires modification.
Peter Brown (talk) 19:42, 18 November 2019 (UTC)
@Peter M. Brown: Sorry for a belated response. I've added three versions of the section being discussed so we may compare'em easily.
[...] incorrectly implying that in general a codespace, regardless of the encoding standard [...] It explicitly refers to the standard and glossary of Unicode so No, it doesn't justify changes. My version(1) of codebase of definition was shorter. Do you agree to replace words «set of of numbers/integers» by «range of numbers» in my version and leave it in place? As well as definition of code points?
[...] citation is sufficient without the quotation. [...] as it repeats information in the main text. [...] This part is elaborated more precisely by WP:CS and WP:CLOP, not WP:FOOTNOTES. The quotation seems to me advantageous since it covers all three definitions and may be placed at three different places simultaneously or at the end and, as I said, it's quick to access. If you insist that the quotation is excessive I would concede.
[...] putting the definition ... is too late, as the phrase has already been used three times [...] This is unreasonable & subjective. It's never late. It's should be given for the sake of clarity. I insist to return it back as reliable source is given.
[...] This remains true. [...] Let's return it back by the end of conversation.
[...] previous version of the article article requires modification I suggest to put it this way: «016 to 10FFFF16». Any thoughts? DAVRONOVA.A. 21:59, 23 November 2019 (UTC)
@DAVRONOVA.A.:
I assume that you're referring to a codespace, though you wrote "codebase".
  • I have no preference whatever between "set of of numbers/integers" and "range of numbers". What is at issue is substantive fidelity to the Unicode Glossary, which has two entries:
(a) A range of numerical values available for encoding characters
(b) For the Unicode Standard, a range of integers from 0 to 10FFFF16
Your version 1 has
(c) set of of numbers/integers used to encode characters in the range of 0hex to 10FFFFhex
Why does the Glossary have two definitions? Because (a) is a general definition covering Unicode, ASCII, EBCDIC, GB 18030, etc. etc. Each of these has a different range of numbers and hence a different codespace. In ASCII, for example, it's 0 to 255. And in Unicode? The answer is provided in (b): for this particular encoding, the range is 0 to 10FFFF16 i.e., in decimal, 0 to 1114111. (c) provides a general statement, at odds with (a): with no specification of the encoding, the range is said to be 0 to 10FFFF16. Of course (c) is shorter than (a)+(b), because it does less: (a)+(b) provides both a general definition of a codespace and a specification of the codespace for Unicode.
  • WP:FOOTNOTES vs. WP:CS: You're right, WP:CS is an official guideline and WP:FOOTNOTES, which flatly contradicts it, is not. I get impatient when I follow a footnote and find the text repeated; "I've already read that," I think. But rules are rules, even when I don't like them.
  • ...the phrase has already been used ... in the lead. I think this is a legitimate complaint. Unless following a section link, everybody reads the lead first, so, if a locution is obscure enough to require a definition at all, it shouldn't be undefined in the lead. Put it back in Architecture and terminology if you must, but also define this putatively obscure locution in the lead.
  • Let's return it back by the end of conversation. Hunh? Return what back where by when?
  • 016 to 10FFFF16 The text in the Glossary has "0 to 10FFFF16", as subscripting 0 is not called for; zero is zero. My text "0 to hexadecimal 10FFFF" provides a link for a person unfamiliar with "hexadecimal" and with the subscript notation, but perhaps no such person would be reading this article. Go with the subscript if you feel strongly about it.
Peter Brown (talk) 01:20, 24 November 2019 (UTC)
@Peter M. Brown:
[...] I assume that you're referring to a codespace [...] Yea, I'm referring to a codespace of course. It was a mistake.
[...] Why does the Glossary have two definitions? Because (a) is a general definition covering [...] I think I got your lengthy explanation. I just discovered that more precise definition of codesppace exists[1] so I suggest to use the following version (I will remove quotations):

Unicode defines a unicode codespace[note 1] – a range of integers from 0 to 10FFFF16.[2][3][1] Any value in the codespace is called a code point. Not all code points are assigned to encoded characters.[2]

—  Draft 1
[...] Hunh? Return what back where by when? [...] I'm going to return back citation you have removed once we come to a consensus over definitions' shape.
[...] Go with the subscript if you feel strongly about it. [...] We also may utilize <ref group="...">... but I think subscriptions with linked type of numbers is the best choice so let's got with it. DAVRONOVA.A. 12:49, 24 November 2019 (UTC)
@DAVRONOVA.A.:
I think that we are agreed. to summarize: the Unicode Standard[1], as you have noted, characterizes the Unicode codespace as
A range of integers from 0 to 10FFFF16.
The sentence I objected to read
Unicode defines a codespace – set of numbers/integers used to encode characters in the range of 0 to 10FFFFhex.
These are not the same. The first is explicity a characterization of a Unicode codespace. The second is a general characterization of a codespace and it errs because other encodings, ASCII for example, have different codespaces. Your proposal to replace the wording with "Unicode defines a unicode codespace..." will correct matters. Unicode nowhere defines a codespace in such a way as to exclude other ranges for other encodings, but that's just what the sentence I reverted claims that Unicode does.
Since you have not responded to my point that a definition of "encoded character" should not appear in the section Architecture and terminology unless in appears in the lead, I take it that you agree and will alter the lead so that it either does not use the locution "encoded Unicode character" or else uses the locution along with a definition.
Peter Brown (talk) 21:30, 24 November 2019 (UTC)
I just ran across MOS:NOTES. Don't [[MOS:]] sections have priority over [[WP:]] ones, hence MOS:NOTES supersedes WP:CS? Since the long quote I deleted falls into none of the four categories allowed under MOS:NOTES, it seems that my deletion was in order. Peter Brown (talk) 16:19, 27 November 2019 (UTC)
@Peter M. Brown:
So do you have any objections regarding my draft proposed here? If so, let me know. We need agreement to proceed.
[...] Since you have not responded to my point that [...] I've answered it here.
[...] I take it that you agree [...] Do not take anything as agreement until I explicitly express it. Addition of definition of encoded character wouldn't decrease article's quality I'm sure.
[...] Don't [[MOS:]] sections have priority over [...] It depends on whether it's a policy or guideline. Both (NOTES & CS) are guidelines and I consider them equal. MOS:NOTES doesn't override WP:CS cause they govern different parts of the article: appearance and structure of footnotes (MOS:NOTES) and its content (WP:CS) respectively. DAVRONOVA.A. 21:06, 27 November 2019 (UTC)

Definitions 2

The changes being discussed:

@Peter M. Brown: I suggest to return back to old definition of codespace with additional sourcing. After this discussion I started to think that it's more concise and accurate. Any objections? DAVRONOVA.A. 22:51, 5 December 2019 (UTC)

Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF16.[2][3][1]

— New proposal based on version0
As it stands, the first sentence of the section contains, literally, the definition of "codespace" in the Glossary. One cannot be any truer to the sources than that. Now that the reader has got that far, it is necessary to be more specific as to what Unicode's codespace is. The second sentence does this, paraphrasing the second sentence in the glossary; I am uncomfortable with actually using that second sentence, which starts "For the Unicode Standard ...", since the phrase "Unicode Standard" appears as a proper name (with a capital 'S', no less), and the reader has not been introduced to this usage. That done, proceeding step by step, the third sentence explains the phrase "code point".
Do I understand you as proposing to start out with a sentence using "code point" without defining it? If so, I disagree. It is linked but, per MOS:LINKSTYLE, "as far as possible do not force a reader to use [a] link to understand the sentence." Version0 is even worse, introducing both "code point" and "codespace" without definitions. Admittedly, this is also done in the lead; I maintain that this also needs correction, but one thing at a time.
Peter Brown (talk) 00:50, 6 December 2019 (UTC)
@Peter M. Brown: Ok, let's leave definition of codespace unchanged.
I was going to ask you to amend your sentence added by this edit: «Not all of these 1,114,112 code points are available for encoding visible characters»; amount of code points isn't mentioned before so word these is unexpected here. DAVRONOVA.A. 22:41, 6 December 2019 (UTC)
Suppose we change the second through fourth sentences to
For Unicode, the relevant codespace consists of 1,114,112 numbers, all the integers from 0 to 10FFFF16. Each of these is called a code point. Not all of them are available for encoding visible characters; some, for example, are assigned to control codes like the carriage return.
OK? You handle the <ref>s, please—I'm likely to mess up again.
Peter Brown (talk) 23:48, 6 December 2019 (UTC)
@Peter M. Brown: It's much better for second through forth parts.
The second sentence does this, paraphrasing the second sentence in the glossary [...] Well I have to revoke my previous agreement over here: the current definition of codespace is clunky once again. It's unnecessary to cite general definition ("characterization") as it's obvious what codespace means regardless of type of encoding. I've opened an RfC to see whose point prevails over definitions of both code space and code points. DAVRONOVA.A. 16:27, 7 December 2019 (UTC)

References

  1. ^ In the article it is referred simply as codespace.

References

  1. ^ a b c d "The Unicode Standard, Version 12.0" (PDF). p. 19. Unicode codespace: A range of integers from 0 to 10FFFF16.
    • This particular range is defined for the codespace in the Unicode Standard.
    Other character encoding standards may use other codespaces.
    {{cite web}}: line feed character in |quote= at position 140 (help)
  2. ^ a b c "Glossary of Unicode Terms". Retrieved 2010-03-16.
  3. ^ a b "2.4 Code Points and Characters". The Unicode® Standard Version 12.0 – Core Specification (PDF). 2019. p. 29. The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

RfC: Which version you like the most?

There is a clear consensus for Version 0.

Cunard (talk) 10:29, 26 January 2020 (UTC)

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

This RfS primarily concerns terms and definitions of Unicode standard. Which version of definitions of «codespace» and «code point» you like the most?:

Please, take a note that counting starts from zero and revisions are listed in chronological order. The discussion may be found here: #Definitions 2. Any of three versions going to have at least 3 sources. DAVRONOVA.A. 16:27, 7 December 2019 (UTC)

Version 0 please.Spitzak (talk) 16:42, 7 December 2019 (UTC)
Version 0. Concise, easy to understand, and not loaded with redundant references. BabelStone (talk) 23:06, 7 December 2019 (UTC)
@BabelStone: Why do you think they are redundant? DAVRONOVA.A. 17:11, 9 December 2019 (UTC)
Version 0. Shmuel (Seymour J.) Metz Username:Chatul (talk) 21:08, 16 January 2020 (UTC)
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Suggested article improvements

I feel like this article should be made more concise, and that maybe much of the material (e. g., the lengthy discussions of encoding schemes scattered throughout the article, particularly older schemes such as UCS-2 and UCS-4) might better be factored out into the corresponding, dedicated articles and replaced here by “See…” references to those articles. Your thoughts? —PowerPCG5 (talk) 02:48, 22 February 2020 (UTC)

If anything, it's the transforms that should be moved to a separate article. UCS-2 (essentially, the #Basic Multilingual Plane) and UCS-4 are part and parcel of Unicode. Shmuel (Seymour J.) Metz Username:Chatul (talk) 02:11, 23 February 2020 (UTC)

Category merge proposed

I have proposed to merge all version-specific subcategories like Category:Scripts encoded in Unicode 13.0 into Category:Scripts encoded in Unicode. Discussion is here. -DePiep (talk) 22:13, 20 March 2020 (UTC)

Special characters in article

A recent edit by User:Beland removed registered trademark symbols (®) and replaced various named character attributes, e.g., &thinsp;, with the actual characters or with ASCII quotation marks. Were there Wikipedia policies requiring that? Intuitively, it would seem that it would be easier to edit with named attributes and that the registered symbol was legally required. Shmuel (Seymour J.) Metz Username:Chatul (talk) 12:35, 7 May 2020 (UTC)

The ® is definitely not legally required, and is proscribed against by Wikipedia:Manual of Style/Trademarks. Conversion of "…" to "..." is required by MOS:ELLIPSIS. ASCII quote marks are required by MOS:STRAIGHT. MOS:MARKUP says in general markup should be kept as simple as possible; since it's usually unnecessary, I've generally been dropping &thinsp; or (as in this case) converting it to a regular ASCII space. It's generally agreed (MOS:NBSP) that when thin spaces are used, a named reference of some kind is preferred over the character itself, since it's difficult to tell apart from other whitespace characters. Adding space around special characters is one of the cases where thin spaces are explicitly allowed; if you prefer them over regular ASCII spaces in this case (or no space), feel free to restore them. I recommend using {{thinsp}}, since this is ignored by the automated scan I was using. -- Beland (talk) 15:04, 7 May 2020 (UTC)
The ® were removed from the titles of referenced documents. I thought the titles of references were to be kept unchanged as much as possible.
I also agree that replacing character references with invisible characters is a bad idea.Spitzak (talk) 21:13, 7 May 2020 (UTC)
About the ®: MOS:TM, already referred to, describes that we should use the independent sources style. On top of this, Unicode themselves prefer to omit the ® symbol: see Version references. -DePiep (talk) 21:33, 7 May 2020 (UTC)
About the ellipsis, the cited MOS:ELLIPSIS calls for the use of a a non-breaking space before an ellipsis. Shmuel (Seymour J.) Metz Username:Chatul (talk) 21:45, 7 May 2020 (UTC)
My interpretation of MOS:TM is that reliable, independent sources can be used to determine styling (like Ipad vs. iPad) but that ® and ™ are to be avoided except as needed for disambiguation. (Though sources independent of the trademark holder rarely use the trademark symbols.) MOS:CONFORMTITLE says that titles of works should be altered to conform to Wikipedia house style. -- Beland (talk) 12:12, 12 May 2020 (UTC)

Number of valid characters

Article reports 143,859 valid characters ; a short python script that runs chr(value) with all possible values of length 1-4 bytes will report 2,294,016 valid characters (and 4,309,516,288 (!) invalid characters) in the byterange. How come there is a factor ~20 between the two data? — Preceding unsigned comment added by 85.0.37.33 (talk) 18:44, 25 May 2020 (UTC)

No, that is not what the article reports. The article does have the text "there is a repertoire of 143,859 characters,"; note that the text neither uses the term valid nor refers to strings of 1-4 octets; it refers to characters that have been assigned code points in the range 0000–10FFFF by the Unicode Consortium. The only text that uses the term valid is distinguishing surrogate pairs from other code points. Shmuel (Seymour J.) Metz Username:Chatul (talk) 20:23, 25 May 2020 (UTC)

"유니코드" listed at Redirects for discussion

A discussion is taking place to address the redirect 유니코드. The discussion will occur at Wikipedia:Redirects for discussion/Log/2021 January 1#유니코드 until a consensus is reached, and readers of this page are welcome to contribute to the discussion. Dominicmgm (talk) 23:35, 1 January 2021 (UTC)

History

Given what Unicode is, an important part of its history is its adoption by word processors (Word, OpenOffice, but notably not WordPerfect) and operating systems (Windows, Linux, ...) and fonts (TTF). As a practical matter for the end user, it didn't become available in 1988, but when they could use it for their documents (I think for most people this meant 1997). — Preceding unsigned comment added by 77.61.180.106 (talk) 00:29, 13 October 2021 (UTC)

Infobox Unicode block: add a 'related' list?

See discussion at Template talk:Infobox Unicode block § Related blocks. -DePiep (talk) 09:43, 27 February 2022 (UTC)

"Quivira (typeface)" listed at Redirects for discussion

An editor has identified a potential problem with the redirect Quivira (typeface) and has thus listed it for discussion. This discussion will occur at Wikipedia:Redirects for discussion/Log/2022 March 15#Quivira (typeface) until a consensus is reached, and readers of this page are welcome to contribute to the discussion. 1234 kb of .rar files (is this dangerous?) 19:02, 15 March 2022 (UTC)

Requested move 16 September 2021

The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion.

The result of the move request was: NOT MOVED: The consensus is that the current name properly describes the contents of this article and is not ambiguous. (non-admin closure) Spekkios (talk) 00:52, 25 September 2021 (UTC)


UnicodeUnicode Standard – The term "Unicode" is ambiguous, and may be used to refer to the Unicode Standard, the Unicode Consortium, Unicode characters, Unicode-encoded text, or any number of things related to the implementation of the Unicode Standard or the processing of Unicode text. The Unicode Consortium actively discourages the use of the term "Unicode" as an isolated noun ("Always use “Unicode” as an adjective followed by an appropriate noun. Do not use “Unicode” alone as a noun" Unicode Consortium Name and Trademark Usage Policy), and states that "The Unicode® Standard" should be used in preference to simply "Unicode" (of course we do not use ® on Wikipedia per MOS:TMRULES). The subject of this article is specifically the Unicode Standard (the opening sentence should be "The Unicode Standard is an information technology standard for ..."), and not the general concept of "Unicode", so the article should be moved to Unicode Standard, with Unicode left as a redirect to avoid having to rename thousands of wikilinks. BabelStone (talk) 16:30, 16 September 2021 (UTC)

  • Agree. DRMcCreedy (talk) 21:03, 16 September 2021 (UTC)
  • Oppose move. See WP:OFFICIALNAMES. We do not use a name simply because it is official, and the common name here is Unicode. O.N.R. (talk) 03:52, 17 September 2021 (UTC)
    • This does not address the ambiguity issue. "Unicode" is commonly used to refer to the Unicode Consortium. Just one random example from a BBC article: "Rachel Murphy and Amy Wiegand sent sample artwork to Unicode as part of their plea for a drone emoji", "Rachel Murphy thinks Unicode is wrong to not include a drone emoji", "Unicode rejected their proposal", etc. BabelStone (talk) 13:33, 17 September 2021 (UTC)
      • The article refers to the consortium as "the Unicode Consortium" on first reference. -- Calidum 15:40, 17 September 2021 (UTC)
      • As Calidum says. And even in its isolated form here, there is no misunderstanding in what is intended: "sent to the Unicode Consortium". How could this be misread? This obviousness is present throughout the article. No ambivalence. -DePiep (talk) 11:39, 18 September 2021 (UTC)
  • Generally oppose as per O.N.R. above. As far as I have ever seen, in general usage, "Unicode" as a bare noun is only used to refer to the standard. Even at UTC meetings with actual officers and members present, use of plain "Unicode" referred only to the standard, never the consortium, Unicode encoded text, characters, or anything else. However, I fully agree the article lede should begin with "The Unicode Standard" as the official name. VanIsaac, MPLL contWpWS 04:47, 17 September 2021 (UTC)
  • Oppose. We use common names, not official ones. -- Calidum 15:40, 17 September 2021 (UTC)
  • Support due to WP:PRECISE, not due to official names. That rationale is badly flawed, yes, but the precision one is very relevant. Red Slash 19:08, 17 September 2021 (UTC)
  • Oppose. First of all, possibble ambiguity is limited to "[the] Unicode Standard" and "Unicode Consortium"; other terms mentioned in the proposal (Unicode characters, Unicode-encoded text, or any number of things ...) do not appear as ambiguous terms. What is ambigu in "Unicode characters"? — instead it is fully self-explaining! When specification between ...Standard or ...Consortium characters be needed, one should do so in the text. Also, a name like "Unicode CLDR" is not shortened to "Unicode" ever, nor is any Unicode Technical Report name [3], so these do not apply.
Second, Unicode themselves uses plain "Unicode" for the Standard throughout and consistently: see main TOC, Glossary. Except for self-referring situations, this leaves no misunderstanding (when self-referring could be confusing, one writes like "The Unicode Standard is maintained by Unicode Consortium"). No problem here.
On wikipedia: As others have noted, WP:OFFICIALNAMES applies. Also, per WP:DISAMBIGUATION: we can easily establish that "Unicode Standard" is the primary topic for "Unicode". From there, we can create article Unicode (disambiguation) (with two entrances then) and add hatnote {{about}} to this article. Also, per WP:COMMONNAME, current title is preferred and acceptable. -DePiep (talk) 12:08, 18 September 2021 (UTC)
  • Oppose I disagree with The subject of this article is specifically the Unicode Standard - it is about a broad concept of Unicode characters, Unicode-encoded text, Unicode input systems ... basically anything other than the organization called Unicode Consortium. I'm not opposed to a new History of the Unicode Standard article which focuses specifically on information about the development of versions of the Unicode Standard. User:力 (power~enwiki, π, ν) 17:30, 21 September 2021 (UTC)
  • Oppose per DePiep and 力 (power~enwiki). Also, if we are to use it, I believe the Wikipedia guidelines for capitalization would indicate that "standard" should be in lowercase (regardless of whether the consortium uses lowercase or not). Wikipedia avoids unnecessary use of uppercase. —⁠ ⁠BarrelProof (talk) 00:21, 24 September 2021 (UTC)
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Vulnerabilities

A security advisory has been recently released from two researchers, one from the University of Cambridge and the other from the same and from the University of Edinburgh, in which they assert that carefully crafted computer source code can be used to introduce vulnerabilities in apparently harmless programs. Some security groups (like the one for Rust language) are already taking measures and issuing their own security advisories.

I think that is something that affects Unicode as source code is one of the main applications of the standard. What do ye think would be a good way to introduce that to the article?

[1] [2] [3]

Bruno Unna (talk) 12:02, 1 November 2021 (UTC)

Looks like existing Unicode § Issues is the place to go. Indeed, a true Unicode case (U+202E RIGHT-TO-LEFT OVERRIDE). Also consider mentioning at Bidirectional text? -DePiep (talk) 12:29, 1 November 2021 (UTC)

سلام M.h.gholamii (talk) 19:24, 14 July 2022 (UTC)

ko M.h.gholamii (talk) 19:24, 14 July 2022 (UTC)

question

I was designing text shapes for electrical symbols and electronic elements. I design them on the Unicode-encoded FontCreator program, but after exporting it and copying and pasting the symbol I designed into the phone programs, it does not work and appears in the form of a question mark, what is the solution? (Note this topic is important for articles development, I want to design different symbols for non-electrical shapes and not only in the field of electricity and I don't want them to be thumbnails but text). Mohmad Abdul sahib talk☎ talk 18:15, 18 April 2022 (UTC)

Likely, FontCreator has the appropriate font, containing the electric symbols. But the receiving programm does not. Looks like the font should have (Unicode block) Miscellaneous Technical. Requires downloadingf a certain font, but I cannot help any further. -DePiep (talk) 19:52, 14 July 2022 (UTC)

New Taskforce WikiProject Unicode?

A proposal is opened at WP:COMP § Taskforce WP Unicode –_proposal. Please take a look. DePiep (talk) 09:35, 2 October 2022 (UTC)

Version 15 & Wikidata

I am adding new blocks & data to Wikidata now. Assuming no DAB needed here, the pages are:

DePiep (talk) 16:10, 13 September 2022 (UTC)

QID added -DePiep (talk) 16:33, 13 September 2022 (UTC)
more listing -DePiep (talk) 18:02, 13 September 2022 (UTC)
Not much time to complete this list, for me. DePiep (talk) 18:12, 13 September 2022 (UTC)
  • Note that, as far as I can see, only two content articles require the "(Unicode block)" DAB-specifier, because of name overlap. The other "X (Unicode block)" pages sould be redirects to their (unambiguously named) content Block article. See also {{Unicode blocks/overview}}. DePiep.
Recent changes in Unicode
List overview · Lists updated: 2022-10-01 · This box: