Jump to content

Talk:Unicode equivalence

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Technical tone

[edit]

The tone of this article is really technical. babbage (talk) 04:23, 4 October 2009 (UTC)[reply]

[edit]

Is it OK to add a link to some software that I found it useful? It's called charlint and its a perl script that can be used for normalisation. It can be found at http://www.w3.org/International/charlint/ Wrecktaste (talk) 15:54, 21 June 2010 (UTC)[reply]

Redirect

[edit]

Glyph Composition / Decomposition redirects here, but the term glyph is not used in this article. — Christoph Päper 15:50, 27 August 2010 (UTC)[reply]

Subset

[edit]

Mathematically speaking, the compatible forms are subsets of the canonical ones. But that sentence is a bit confusing and should probably be rewritten. 213.100.90.101 (talk) 16:36, 11 March 2011 (UTC)[reply]

Then please do so. I prefer a readable Unicode description. -DePiep (talk) 22:46, 11 March 2011 (UTC)[reply]

Rationale for equivalence

[edit]

The following rationale was offered for why UNICODE introduced the concept of equivalence:

it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.

AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets.
One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration to UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets.
All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. Jorge Stolfi (talk) 14:49, 16 June 2011 (UTC)[reply]

I agree it would be nice to find a source that says the exact reasons. There are better quotes in some other Unicode articles on Wikipedia. However, except for the precomposed characters, all your reasons are the same as "an exising character set had N ways of encoding this character and thus Unicode needed N ways".
Precomposed characters were certainly mostly driven by the need to make it easy to convert existing encodings, and to make rendering readable output from most Unicode easy. There may have been existing character sets with both precomposed and combining diacritics, if so this would fall into the first explanation. But I doubt that would have led to the vast number of combined characters in Unicode.Spitzak (talk) 18:58, 16 June 2011 (UTC)[reply]


So Unicode equivalence is necessary. the question you want to answer is why where NFD and NFC introduced?86.75.160.141 (talk) 21:31, 24 October 2012 (UTC)[reply]

Usage

[edit]

This article says nothing about Unicode equivalence usage. This mean there is misisng some text.

I do not know many software which relies on/supports Unicode equivalence, but there is at least one: Wikipedia.

Unicode équivalence is recognized by Wikipedia software in a way which allows users of both NFD and NFC systems to access the same page-article despite technical internal NF differentiation.

Might be that some people woul need a reference to proove this, but I do not bring any refernce, only this demonstration:

For instance those two pages are a single article:

The same does occur with Cancún: Cancún and Cancún (despite colors might differ for any obscure and not obious reason):

I suggest to use this information in a way to improve the article, without making wikipedia article any «how to use wikipedia». 86.75.160.141 (talk) 19:14, 11 October 2012 (UTC)[reply]

I found this article looking for why 𝓌𝒾𝓀𝒾𝓅𝓮𝒹𝒾𝒶.org seemed to land me on wikipedia.org - might be a good illustrating example. 155.94.127.118 (talk) 23:36, 4 September 2019 (UTC)[reply]

Well-formedness

[edit]

"Well-formedness" refers to whether the sequences of 8-bit, 16-bit or 32-bit storage units properly define a sequence of characters (technically, 'scalar values'). Having combining characters without base characters makes a string 'defective'. There are other faults in a well-formed string that have no name, such as broken Hangul syllable blocks, characters in the wrong order (not all scripts have been set up so that canonical equivalence will 'eliminate' ambiguities), and variation selectors in the wrong places. RichardW57 (talk) 00:49, 17 June 2014 (UTC)[reply]

[edit]

Hello fellow Wikipedians,

I have just added archive links to one external link on Unicode equivalence. Please take a moment to review my edit. If necessary, add {{cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true to let others know.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 18:31, 18 January 2016 (UTC)[reply]

Canonicality

[edit]

Currently the article states (under Combining and precomposed characters) that "In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur", but also (under Canonical ordering) that the canonical decomposition (of U+1EBF) U+0065 U+0302 U+0301 "is not equivalent with U+0065 U+0301 U+0302". Either this is a contradiction and should be fixed, or some further clarification would help.

German Article

[edit]

There seems to be a German version, which is not linked: https://de.wikipedia.org/wiki/Normalisierung_(Unicode) Skillabstinenz (talk) 20:22, 23 June 2020 (UTC)[reply]

Naming

[edit]
Why do we have this article named as «Unicode equivalence» instead of «Unicode normalization forms»? It's confusing.

AXONOV (talk) 10:43, 21 January 2022 (UTC)[reply]

A with ring above example

[edit]

The text has:

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å")

But the two "A with ring above" characters appear to be the same Unicode character. This defeats the effectiveness of the example, for those who care to inspect the characters carefully (eg by copy-and-paste into some Unicode inspector tool). I would have to look into the article history in detail to see when these two characters were made the same. Cmcqueen1975 (talk) 06:53, 29 October 2024 (UTC)[reply]

I didn't write the original text but have rewritten the section to clarify. Does that respond to your concern?
I think the intention was to convey that it doesn't matter which of the two codepoints you use, since they are canonically equivalent. (Which is a face-saving way of admitting that someone boobed way back. A codepoint for angstrom sign should never have been created but what's done is done.) 𝕁𝕄𝔽 (talk) 08:01, 29 October 2024 (UTC)[reply]