Jump to content

Talk:Specials (Unicode block)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Replacement Characters

[edit]

Much of the text around replacement characters is a very one-sided opinion piece. Wikipedia shouldn't be the place to try to indoctrinate programmers of text editors to one's point of view. And the suggestion "just save the file as it was" completely ignores the problems of users changing or copy and pasting that part of the file from one point to another. If the editor is editing it in UTF-8, then it should presume the user has made UTF-8 text, even if the editor had provided mojibake to begin with.

However, that is all beside the point as "how do replacement characters get introduced" is only a small part of "what is a replacement character".

The suggestion to pretend that the underlying data was Windows-1252 is a further US-centric viewpoint and doesn't help if you already have a U+FFFD in your data. — Preceding unsigned comment added by Krestadroxefotk (talkcontribs) 18:28, 19 October 2022 (UTC)[reply]

Accurate: The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol.[4] It is usually seen when the data is invalid and does not match any character:

A decent example, but this is not the only cause. Consider a text file containing the German word für (meaning 'for') in the ISO-8859-1 encoding (0x66 0xFC 0x72). This file is now opened with a text editor that assumes the input is UTF-8. The first and last byte are valid UTF-8 encodings of ASCII, but the middle byte (0xFC) is not a valid byte in UTF-8. Therefore, a text editor could replace this byte with the replacement character symbol to produce a valid string of Unicode code points. The whole string now displays like this: "f�r".

Additional possibilities: Improper substrings can cause illegal UTF-8 or UTF-16 which may turn into replacement characters. For example, if a UTF-8 encoded für was later in a sentence, and an application took the first 16 bytes to do some work, it is possible that it could get an incomplete piece of the UTF-8 encoded ü.

Another possibility is data corruption during transmission. A changed bit or truncated packet can contain illegally encoded Unicode, which is may be changed to U+FFFD to indicate the error.

Irrelevant speculation. This does not consider what happens if a user "fixes" some of the cases &/or moves text around, or many other considerations. Garbage in, garbage out. The badly implemented text editor should be corrected by using the correct encoding when reading the file. A poorly implemented text editor might save the replacement in UTF-8 form; the text file data will then look like this: 0x66 0xEF 0xBF 0xBD 0x72, which will be displayed in ISO-8859-1 as "f�r" (this is called mojibake). Since the replacement is the same for all errors this makes it impossible to recover the original character. A better (but harder to implement) design is to preserve the original bytes, including the error, and only convert to the replacement when displaying the text. This will allow the text editor to save the original byte sequence, while still showing the error indicator to the user.

Reasonable At one time the replacement character was often used when there was no glyph available in a font for that character. However, most modern text rendering systems instead use a font's .notdef character, which in most cases is an empty box (or "?" or "X" in a box[5]), sometimes called a "tofu" (this browser displays 􏿾). There is no Unicode code point for this symbol.

The first sentence of the following is true. The remainder is speculation and considers only one possible way errors can be introduced. Not only is the suggested workaround non-global, but it completely ignores many of the other possible causes of a corrupted data stream. Which can aggravate, not improve, the problem. In some contexts this may be interesting, but this broad generalization is not valid and shouldn't be encouraged so emphatically on wikipedia.

Thus the replacement character is now only seen for encoding errors, such as invalid UTF-8. Some software attempts to hide this by translating the bytes of invalid UTF-8 to matching characters in Windows-1252 (since that is the most likely source of these errors), so that the replacement character is never seen. — Preceding unsigned comment added by Krestadroxefotk (talkcontribs) 18:39, 19 October 2022 (UTC)[reply]

Rhombus

[edit]

Why is the replacement character described as a black rhombus with a white question mark when it's a square? Granted, all squares are rhombuses … But then, why not call it a rectangle? Or a quadrilateral? Or even a polygon? Aeron --2A01:CB1D:2E8:D000:223B:609C:AD60:6D0 (talk) 09:06, 10 June 2022 (UTC)[reply]

I'm guessing because most fonts (at least on my Windows 11 laptop) use a rhombus for U+FFFD instead of a square. The Unicode chart does use a rotated square but as noted at the top of the chart "The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be expected in actual fonts.". DRMcCreedy (talk) 15:18, 10 June 2022 (UTC)[reply]

Assignment detail

[edit]

The "Unicode version history" section of the infobox details how many characters were assigned in each release. I think reiterating that in the text is pointless. I propose deleting the Of these 16 code points, five have been assigned since Unicode 3.0 sentence entirely. If that's not acceptable we need clarify the wording of this redundant detail. I oppose "as of Unicode 3.0" because it makes it sound like the article is out-of-date. Likewise, I oppose the latest wording of "since Unicode 3.0" because no code points have been assigned since version 3.0 was released. Of these 16 code points, five have been assigned is accurate but again redundant as the infobox gives more detailed information. My preference is deleting that sentence. DRMcCreedy (talk) 22:08, 21 October 2024 (UTC)[reply]

Yes delete it Spitzak (talk) 01:15, 22 October 2024 (UTC)[reply]
Done. Thanks. DRMcCreedy (talk) 02:13, 22 October 2024 (UTC)[reply]