Talk:UTF-32
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||
|
Doesn't explain the details of the encoding
[edit]Yes, it's a unicode encoding. Yes if you want to know the details you can go read the standards. But this encoding ought to be simple enough to explain on this page. Specifically I came here to see whether the code points should be encoded little endian or big endian, i.e. should the least significant byte or most significant byte go first. Another aspect of UTF-32 is the possibility of a byte order mark indicating whether the following data used little endian or big endian.
Currently unicode defines 3 32-bit encodings. Their names all start with UTF-32. This page has no mention of them or how they differ.
Unicode standards have evolved since the old confusing days. The language, definitions, and FAQs have done a good job of clearing up a lot of confusion and I highly recommend checking that out for anyone who is interested. Good luck Wikipedia! 2600:1700:BA69:10:C50D:52CC:4048:FCA7 (talk) 06:16, 7 July 2023 (UTC)
This is not an encoding
[edit]This is not an encoding, but simply Unicode itself. If a character's Unicode code is 42, then the 32 bit integer which holds 42 is not "UTF-32". It's just the code of that character. UTF-32 is merely a synonym for Unicode. UTF-32('A') is 65, and Unicode('A') is 65. Same thing.
When we have a string of ASCII characters encoded as 8 bit values, do we call that ATF-1? No, it is just an ASCII string.
Another problem is that if this is an encoding, what is the byte order? Is the character 'A' stored as 00 00 00 41 or is it stored as 41 00 00 00? If this is an encoding, we should be able to answer such a basic question. Encoding means settling all issues of bitwise representation for the purposes of transport.
24.85.131.247 (talk) 22:49, 29 October 2011 (UTC)
- No, it's a encoding (big-endian, by default on networks[1]), just your example is a hypothetical encoding: "When we have a string of ASCII characters encoded as 8 bit values, do we call that ATF-1?" you might, it would be an 8-bit encoding of ASCII; note ASCII characters aren't defined as 8-bit, they are seven-bit. Everything in computers is an encoding.. coding something, say colors, like letter (or even integer numbers), that do not have intrinsic "number"/bit pattern. comp.arch (talk) 22:26, 24 September 2016 (UTC)
IPsock of SlitherioFan2016.
|
---|
0 -> 0x00000000[edit]This notation seems more important, as it suggests that the number is 32 bit. 108.66.232.241 (talk) 20:57, 5 November 2016 (UTC) |
Why 4 byte? Why not 3?
[edit]Why is there no 3-byte encoding? 2^24 is 16,177,216, much more than is needed to represent the 1,114,112 character codes of Unicode. Is this because of word boundaries? I understand there are tradeoffs, but wouldn't someone somewhere have use of a simple to process encoding that didn't waste a whole byte for each character? --Apantomimehorse 06:47, 9 September 2006 (UTC)
- Truth is if they had planned things properly from the beginning i doubt this encoding would exist. If you are going to the trouble of supporting supplementary characters you will probably want other advanced text features too which will nullify most of the advantages of a fixed width encoding.
- In any cace you'd be pretty mad to use UTF-32 or UTF-24 for storage or transfer purposes and if you wan't to use a 3 byte encoding internally in your app or app framework theres nothing to stop you (though i strongly suspect it will perform far worse than either a well written UTF-16 or UTF-32 system). Plugwash 00:39, 11 September 2006 (UTC)
- The reason for a 4-byte and not a 3-byte encoding should be simple, a 32-bit number is a native unit for today's dominating 32-bit and 64-bit processors. For example, reading memory in units of 24 bits would be much more expensive than the larger chunk of 32-bits for this reason. -- Sverdrup (talk) 18:58, 8 November 2011 (UTC)
- It's because there is a private use area 0x60000000 to 0x7FFFFFFF. 108.71.121.98 (talk) 18:35, 12 September 2016 (UTC)
- Incorrect, there is no such private use area. You may be thinking of 0x0f0000 – 0x10ffff which are the last two planes. The reason for 4 bytes is that most machines have the ability to address 4-byte units of memory as an integer, but not 3-byte units of memory.Spitzak (talk) 18:44, 12 September 2016 (UTC)
- IBM S/370 through current z/architecture, have LCM and STCM that allow load and store of, among others, three byte data. Presumably faster than the masking that would otherwise be required. Note that S/370 has 24 bit addresses, normally stored in 32 bit words, but possibly with other data in the high byte. It would be pretty easy to index an array of three byte data, though. Gah4 (talk) 23:54, 9 June 2017 (UTC)
- No, I'm thinking of the one outside Unicode, which is 0x60000000 to 0x7FFFFFFF. There is also one that is 0x00E00000 to 0x00FFFFFF. 108.71.121.98 (talk) 21:58, 12 September 2016 (UTC)
- The last 32,768 planes are unassigned. 108.66.233.59 (talk) 17:56, 5 October 2016 (UTC)
NPOV?
[edit]Is it just the way I'm reading this article, or does it stink of a total lack of NPOV? Almost reads like a case for everybody forgetting about UTF-32.. UTF-32 space inefficient? Not if you're Japanese. The whole reason the character handling is in the state it's in is because people didn't care about the needs of other people. It was pretty clear a long time ago that a solution was needed to i18n and that something not unremarkably like UTF-32 was needed.
"Also whilst a fixed number of bytes per code point may seem convenient at first it isn't really that much use. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing."
Well, no. If you're talking about drawing glyphs sure, but it has absolutely no pros/cons as compared to other charsets in that context. It makes i18n string handling easier by an order of magnitude though. All you do is divide everything by four put simply. Try counting the length of a string in UTF-8 or UTF-16.. It's just about impossibly to do in a stable way.. Look at the whole "Bush hid the facts" bug in notepad.. the *perfect* example of an issue that would never have occurred with UTF-32. http://www.evilshroud.com/bushhidthefacts/ --Streaky 03:35, 30 November 2006 (UTC)
- Inefficient is definitely true, in the best case its no better than either UTF-8 or UTF-16 and in the common cases (yes that includes Chinese and Japanese) it is far worse.
- What *IS* the code point count useful for? Most of the time what matters is either size in memory, grapheme cluster count or console position count.
- As for the <name> hid the facts "bug" you mentioned, it doesn't look like a charset issue to me (and is almost certainly not related to either UTF-8 or UTF-16). To me it looks like a deliberate easter egg but unless someone can translate Plugwash 12:52, 30 November 2006 (UTC)
- The "bush hid the facts" issue is a technical side effect, neither a bug nor an easter egg. Since a text file doesn't carry information about its encoding, you have to guess. Especially for short strings it sometimes comes out wrong, treating a text with encoding X like one with encoding Y. More details. --193.99.145.162 17:21, 27 June 2007 (UTC)
- "Bush hid the facts" is in fact *caused* by the use of a non-byte encoding (UCS-2), rather than an ASCII-compatible encoding such as UTF-8. Use of UTF-32 would result in similar bugs. So in fact it is an argument *against* using UTF-32.Spitzak (talk) 22:18, 21 April 2010 (UTC)
- Not sure what the article means by "more than one character position per code point (for example CJK ideographs)", won't these be one (CJK) character per code point as well? Regarding whether this article is NPOV, since most commonly used CJK characters are in the BMP, which can be represented with only 2 bytes, always using 4 bytes to represent these is wasteful even if you are Japanese. Raphanid 22:59, 29 June 2007 (UTC)
- Raphanid: In traditional fixed width CJK fonts an ideograph is 2 character positions wide (that is twice the width of a latin alphabet letter).
- 193.99.145.162: Do you have a source for the hid the facts thing being a misdetection (that MSDN page isn't one). Given the length and pure English nature of the message it seems pretty unlikely. Plugwash 17:42, 30 June 2007 (UTC)
UTF-32 not used?
[edit]For these reasons UTF-32 is little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode text
i disagree with this statement. wchar_t in unix/linux C applications is in UTF-32 format. This makes it pretty often used. Vid512 17:28, 19 March 2007 (UTC)
- Also UTF-32 is used as the internal format for strings in the Python programming language—the C-based reference implementation at any rate. (Actually it uses UCS-4, as Python does not impose the restriction against lone high or low surrogates from being encoded). As there are no referenced facts to support this statement and there are clear uses of UTF-32/UCS-4 in systems today, I'm removing the claim that it's not used. - Dmeranda 04:44, 5 August 2007 (UTC)
- I noticed this and also disagreed. But when I read that Dmeranda removed it, well, it was still there! Checking, it appears he removed another statement instead. I've restored that and removed the above claim. mdf (talk) 15:38, 27 November 2007 (UTC)
- CPython can be compiled to use either UCS-4 or UCS-2 internally. It defaults to UCS-2, but many Linux distributions compile it to use UCS-4. Agthorr (talk) 15:44, 9 May 2010 (UTC)
Removing cleanup and dubious tags
[edit]Well first, the edit comment is a bit inaccurate. I quickly looked through the history and I thought I saw the cleanup message there but upon further examination it seems I was wrong. So it should instead say that the cleanup tag is wrong in that it was most certainly not there since September 2007, and the article is in fine shape (though not the best it could be), so it shouldn't be there.
As for the dubious tags, the claim that it's more space efficient is well justified by the following sentence, which notes that non-BMP characters are rare. This is by design; the BMP is intended to contain pretty much every character in major (and most minor) modern languages, as the standard notes [2]. The BMP takes 2 bytes in UTF-16 and 1 to 3 in UTF-8, so for text consisting of BMP characters, UTF-32 obviously takes more space. For a real world example, with a file consisting of large amounts of Japanese and ASCII text (all in the BMP), it is 10MB with UTF-8, 14MB with UTF-16, and 28MB with UTF-32.
For the claim that it's rarely used, Unixy systems use UTF-8, Windows uses UTF-16, various programming languages mostly use either (though I know Python can use UCS-4 if you compile it so). Another message on this page talks about wchar_t, but that's implementation-specific, and the Unicode standard even advises against it for code that's supposed to be portable for this reason [3]. In my experience it doesn't seem to be used nearly as much as others, though I admit my experience in this area isn't quite vast. Regardless, a completely implementation-specific data type in a single language hardly changes matters.
By those reasons I've removed those tags. The article might do with a few citations, but there's no dubious information in it, and though it could be improved, it's written well enough that it does not require a cleanup. 24.76.174.152 (talk) 07:15, 19 November 2009 (UTC)
Character vs. code point
[edit]The History section reads: "UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF." Shouldn't this be "each encoded code point"? A 32-bit value doesn't necessarily represent one character, come characters are composed of several values. Tigrisek (talk) 19:45, 16 January 2011 (UTC)
- I agree, this fix was already done for UTF-8/16 pages.Spitzak (talk) 03:41, 18 January 2011 (UTC)
NPOV
[edit]I'm a seasoned software developer, and I believe that it's damn convenient that the Nth character of a string can be found by indexing to position [N-1] in an array (or [N], if 1-based).24.85.131.247 (talk) 22:18, 29 October 2011 (UTC)
- As a "seasoned software developer" I would be interested in you locating actual code you have written where it looked at character N in a string without first looking at characters 0..N-1. Using the return value from another function that looked at characters 0..N-1 does not count. The N must be generated without ever looking at the string. Any other use can be rewritten to use code unit offsets or iterator objects and is not an argument for fixed-sized code points.Spitzak (talk) 22:20, 31 October 2011 (UTC)
- For instance: Take a look at the -c switch in the "cut (Unix)" article. — Preceding unsigned comment added by 62.159.14.9 (talk) 10:14, 8 February 2016 (UTC)
- Sorry, wrong. The cut command reads utf-8, and therefore has scanned all the "characters" before n and can count while doing so. And in fact unless the writers are complete idiots, this is how it would be written. Note the -n switch ("don't split multibyte characters"), this is a good indication that cut does not convert to UTF-32 at any point. I would also like to see an actual script that uses the -c switch and would fail to do the desired result if -b and -n were used instead.Spitzak (talk) 02:03, 9 February 2016 (UTC)
- Boyer moore indexes through strings without looking at all characters in between. In the best case, it can search through a string looking at every Nth character. Boyer Moore is commonly used by grep and related search programs. Gah4 (talk) 18:41, 29 September 2016 (UTC)
- Boyer moore can be done using code units, not code points, so it does not count. If you want to match a 3-byte UTF-8 character, it is the same as matching 3 1-byte characters.Spitzak (talk) 00:28, 30 September 2016 (UTC)
- But how do you know where the code points are? The whole idea of B-M is that you don't have to look at many of the characters, but that only works if you know how many there are. Gah4 (talk) 03:03, 30 September 2016 (UTC)
- Think real hard now: how does it handle the substring "ABC" in the pattern? Now figure out how to reuse that for a "character" that is three bytes in utf8. You can do it.
- Spitzak asked for an algorithm that looked at character N without looking at characters 0..N-1. Given an N+1 character query, B-M does that. In the best case, it looks at exactly every Nth character. As asked, N is generated without looking at the string being searched. As for actual use, B-M always depends on the statistics of the strings. I suspect that UTF-8 strings with 2 or 3 byte characters will be statistically worse than those with only 1 byte characters, but that wasn't asked. Gah4 (talk) 08:07, 30 September 2016 (UTC)
- A N code point string can be converted to an M byte utf8 string. You then apply the search algorithm to the M bytes. There is still no need for fixed-size code points.
- This is true, but it answers a different question. It may, in fact, often be the better solution. Note, though, that many systems are slow when indexing bytes, but fast when indexing 32-bit words. If one is doing a large enough search, it might be that the 32-bit version is faster. Also, B-M works on the statistics properties of characters in strings. The statistics of UTF-8 strings might be very different from UTF-32 strings. Considering WP:NPOV, we shouldn't make assumptions on what someone might do, but just supply the facts. Gah4 (talk) 05:13, 9 June 2017 (UTC)
- Optimized byte-based pattern matching can and does read/write larger units than bytes, such as 64-bit units on modern machines. Changing all the code points to 32 bits actually hurts as they will now only optimize 2 characters at a time, rather than up to 8 characters at a time.Spitzak (talk) 19:04, 9 June 2017 (UTC)
- This is true, but it answers a different question. It may, in fact, often be the better solution. Note, though, that many systems are slow when indexing bytes, but fast when indexing 32-bit words. If one is doing a large enough search, it might be that the 32-bit version is faster. Also, B-M works on the statistics properties of characters in strings. The statistics of UTF-8 strings might be very different from UTF-32 strings. Considering WP:NPOV, we shouldn't make assumptions on what someone might do, but just supply the facts. Gah4 (talk) 05:13, 9 June 2017 (UTC)
- Don't know about that guy, but I am currently writing a soft real-time appliance (running on BareMetalOS) in assembly. Fixed width encoding is very useful to me, even though I too don't necessarily consider myself a strictly "novice" programmer as mentioned in the article. This does have real uses. Sometimes space just needs to be bounded but can be arbitrary in principle, as long as you can do your stuff in a low fixed number of processor cycles. There are real use cases for this stuff, even for people who are not novices (not that I'm an expert on anything either). That's why it finds use. That wording in the article just seems unnecessary. If enough people agree, I would be personally willing to come up with a complete rewrite of most sections to discuss and see if it might improve the article. Does Wikipidia have a way of proposing large rewrites on the talk page without actually changing the article immediately? Like a pull request? --79.230.175.7 (talk) 19:42, 28 May 2016 (UTC)
- There is no reason to "bound" space to a certain number of Unicode code points. You could instead bound it to a certain number of code units, thus fitting in more 1-byte UTF-8 characters than 4-byte ones. It is trivial to find the start of a character if your bound is in the middle of it. So limiting things to fit in a buffer is not a reason to use UTF-32.Spitzak (talk) 00:30, 30 September 2016 (UTC)
- This is just wrong. You may not have noticed when the guy said BareMetalOS. BareMetalOS does everything, at the application level, in terms of finite memory allocations whose sizes are predetermined. The reason to 'bound' space to a certain number of codepoints is because his dang code is going to segfault the instant anybody gets careless and goes over. 173.228.13.5 (talk) 18:32, 27 May 2022 (UTC)
- There is no reason to "bound" space to a certain number of Unicode code points. You could instead bound it to a certain number of code units, thus fitting in more 1-byte UTF-8 characters than 4-byte ones. It is trivial to find the start of a character if your bound is in the middle of it. So limiting things to fit in a buffer is not a reason to use UTF-32.Spitzak (talk) 00:30, 30 September 2016 (UTC)
- Yes, look up "sandbox". It used to be allowed in the "article space", but not longer, so you need to have it under your own talk page. You can copy the article to there and change as you wish, and point us to it. comp.arch (talk) 22:04, 24 September 2016 (UTC)
- Typically you do not want to deal all encoding stuff yourself but use a library instead. As a library developer, I'm pretty sure my users will not expect a linear-time
at()
oroperator[]
, and most of them hate iterators. People preferch = str[str.find('<p>') + 3]
instead ofch = * std::next(str.find_iter('<p>'), 3)
133.130.103.130 (talk) 03:10, 9 June 2017 (UTC)- First of all I strongly suspect that your users are actually looking for codd units, or don't care because they are only checking for some ASCII characters. They may also want UTF-16 code units due to the usage of those on Windows. "code points" are only useful if the program is using UTF-32 everywhere, which is a rather circular argument since the point of this is that they are not used. If you really want "characters" then you are going to have to deal with normalization and combining marks and all that. A cache of previous indexes to offsets would solve this problem and is very commonly done.
- And your example will work, adding 3 to the find of "<p>" will return a pointer to the character after the '>' just like you want.Spitzak (talk) 19:06, 9 June 2017 (UTC)
- Typically you do not want to deal all encoding stuff yourself but use a library instead. As a library developer, I'm pretty sure my users will not expect a linear-time
- This assumes that the language you're using has a mechanism to index a string via arbitrary code unit offset. Also, I have actually run into a case where I had to perform random access into a string: generating symbols from a string of legal characters. It was massively faster to convert the string into an array of one-character strings. -- Resuna (talk) 22:13, 30 January 2019 (UTC)
- Sorry, can you explain what "generating symbols from a string of legal characters" means? Were you randomly selecting characters from the string? Did you have to give them all equal weight? Does "legal" mean non-composite characters only? Still not very convinced. And all languages that stores the string by code units provide a method to index the nth code unit, you seem to be confused by languages that attempt to mangle string data by converting it into another encoding, which is not relevant to this discussion because the offsets are now in the code units of that encoding.Spitzak (talk) 02:34, 31 January 2019 (UTC)
- I have a string containing a list of characters that are legal in a namespace (say digits, letters, and a set of special characters like "@$_"). I'm generating a symbol to be used as an identifier in that namespace, say by encoding a UUID into a mimimal length string. So I have ValidCharacters = "01234...stuff..." and Len=ValidCharacters.length() ... and I'm looping over symbol = symbol + ValidCharacters[uuid%Len]; uuid = uuid/Len. Symbol generation was a bottleneck that was fixed by turning ValidCharacters into an array of single-character strings ValidCharacters = ["0","1","2",...] because using the language's own implementation of [] used the general case of finding the Nth character in any string which required walking the string (O(N)) while finding the Nth string in the array was O(1). -- Resuna (talk) 11:43, 10 April 2019 (UTC)
- All you have done is provide an example where [] should be in code units, as is strongly recommended and done by virtually every string processing library now. Idiot savants who think "characters" are important and make strlen and [] work in "characters" cause horrid performance with absolutely zero gain, as you have demonstrated. If (as I suspect) the set of "ValidCharacters" is ASCII-only, then you could use your algorithm unchanged and fast if [] was in offsets. If you are trying to produce 8-bit bytes in a 1-byte encoding, I STRONGLY recommend you rewrite your code to use an array of integers, relying on character set translation to preserve binary data like this is very sketchy. If you are using UTF-8 in your resulting identifier that is just stupid, because it would be less efficient than changing your encoding to only use ASCII (two ASCII characters have less overhead than one 2-byte UTF-8 character and thus can contain more binary information).Spitzak (talk) 15:31, 10 April 2019 (UTC)
- At no point was there even an implication that "..."[] and similar operations operated on "characters" rather than code points. On the contrary, all strings are Unicode, internally stored as UTF-8, indexed by code point, and enforced by the language definition. The result is to be inserted verbatim in a UTF-8 text file, as is the normal result of a gensym operation. -- Resuna (talk) 16:38, 23 April 2019 (UTC)
- Since you called your array "validCharacters" you seem to be under the impression that indexing a string should return a "character". If this set contains only ASCII then you could use a string in UTF-8. If it does not then your encoding is stupid, since it is much less efficient to use UTF-8 this way than to use ASCII only (two ASCII characters contain 14 bits of information, a single 2-byte UTF-8 character only contains 12 bits).Spitzak (talk) 23:16, 23 April 2019 (UTC)
- You're really nitpicking over the variable name being "validCharacters" instead of "validSymbols"?
- I have no control of the implementation of the Swift string library, but it didn't have the capability of specifying a UTF-16 or ASCII string, all strings are UTF-8 and all indexing on the string is by iterating over it. As I said, I *did* convert the string into an array, just as you suggested, using 1 character strings instead of integers to avoid cluttering the code with string-to-integer-to-string conversions. -- Resuna (talk) 18:10, 1 October 2019 (UTC)
- Since you called your array "validCharacters" you seem to be under the impression that indexing a string should return a "character". If this set contains only ASCII then you could use a string in UTF-8. If it does not then your encoding is stupid, since it is much less efficient to use UTF-8 this way than to use ASCII only (two ASCII characters contain 14 bits of information, a single 2-byte UTF-8 character only contains 12 bits).Spitzak (talk) 23:16, 23 April 2019 (UTC)
- At no point was there even an implication that "..."[] and similar operations operated on "characters" rather than code points. On the contrary, all strings are Unicode, internally stored as UTF-8, indexed by code point, and enforced by the language definition. The result is to be inserted verbatim in a UTF-8 text file, as is the normal result of a gensym operation. -- Resuna (talk) 16:38, 23 April 2019 (UTC)
- All you have done is provide an example where [] should be in code units, as is strongly recommended and done by virtually every string processing library now. Idiot savants who think "characters" are important and make strlen and [] work in "characters" cause horrid performance with absolutely zero gain, as you have demonstrated. If (as I suspect) the set of "ValidCharacters" is ASCII-only, then you could use your algorithm unchanged and fast if [] was in offsets. If you are trying to produce 8-bit bytes in a 1-byte encoding, I STRONGLY recommend you rewrite your code to use an array of integers, relying on character set translation to preserve binary data like this is very sketchy. If you are using UTF-8 in your resulting identifier that is just stupid, because it would be less efficient than changing your encoding to only use ASCII (two ASCII characters have less overhead than one 2-byte UTF-8 character and thus can contain more binary information).Spitzak (talk) 15:31, 10 April 2019 (UTC)
- I have a string containing a list of characters that are legal in a namespace (say digits, letters, and a set of special characters like "@$_"). I'm generating a symbol to be used as an identifier in that namespace, say by encoding a UUID into a mimimal length string. So I have ValidCharacters = "01234...stuff..." and Len=ValidCharacters.length() ... and I'm looping over symbol = symbol + ValidCharacters[uuid%Len]; uuid = uuid/Len. Symbol generation was a bottleneck that was fixed by turning ValidCharacters into an array of single-character strings ValidCharacters = ["0","1","2",...] because using the language's own implementation of [] used the general case of finding the Nth character in any string which required walking the string (O(N)) while finding the Nth string in the array was O(1). -- Resuna (talk) 11:43, 10 April 2019 (UTC)
- Also a software developer. I definitely agree. Unicode seems designed to maximize programming frustration, and this is only one more of the ways it does so.
- Jumping to the middle of a string is a pretty normal operation. Clean indexing is so important that even UTF-32 wasn't good enough. We had so many people tripping up on indexing vs. character length that I even wrote string handling libraries that manipulate sequences of 32-bit numbers where each number is a lookup key to one Unicode grapheme cluster.
- With a uniform-length encoding chopping a string into eighty-character chunks for a formatting program involves nothing more complicated than adding 80 to the index to find the next break. I like that. I like knowing how much memory I need to reserve for an 80-character string. Breaking on any index boundary never leaves something that's not a valid character on either side of the break, nor creates any characters that weren't in the original string, nor deletes characters from the string. I like that. When a user with an editor selects a region of text and does some operation, I can calculate from the page offset and the actual on-screen locations of the ends of the selected region exactly what characters will be affected, without crawling all the characters from the top of the document. That's pretty nice too. If I'm searching for a string of known length and a prefix search fails I can jump ahead by the search string length by just adding it to the index, instead of crawling ahead one codepoint at a time counting characters that I know damn well can never match. I like that. When there are multiple page-downs in the command buffer it's also kind of cool to be able to just do a multiplication to see how far to jump and handle all of them at once, instead of crawling over every last codepoint giving the user time to notice the lag.
- And the list goes on. There are just a ton of conveniences for a uniform length of the encoding. I've gone to a hell of a lot of trouble to provide uniform-length encoding to the guys in our shop, in spite of the need to handle at least unicode-8 (meaning, the set of unicode characers that can be represented in 8 codepoints or less). And it's helped cut down errors a lot. The ideal is that in the future, when libraries like this are built into something programmers can use, they should NEVER have to think about the length of an encoded character again. And that's important. 173.228.13.5 (talk) 18:26, 27 May 2022 (UTC)
Checking the end of a string not useful??
[edit]- However there are few (if any) useful algorithms that examine the n'th code point without first examining the preceding n-1 code points, therefore a somewhat more difficult code replacement will almost always negate this advantage
This is bull. I'm a senior software developer, and I can't even count how often I need to examine the last few characters at the *end* of a string (examples: Check if a path is a directory or a file; check a file's extension)
EDIT: Seeing how others have already had almost exactly the same complaint, I'll now remove / rephrase the mentioned section.
- You can pattern-match the end of a string by going *backwards* from the end, which does not require fixed-sized code units. The current text is correct, do not change it.Spitzak (talk) 00:34, 30 September 2016 (UTC)
This statement should be removed entirely. — Preceding unsigned comment added by 82.139.196.68 (talk) 09:43, 22 April 2012 (UTC)
- All you are saying is that an offset (such as the length) should be in code units, not "characters". You can find the end of a UTF8 or 16 string instantly if the length is in code units, negating any advantage of UTF32. Anyway there is more text about this below, so this is ok. You should be warned however that thinking offsets must be in "characters" does not match claiming you are some kind of expert.Spitzak (talk) 15:50, 23 April 2012 (UTC)
Citation needed
[edit]There are two citation needed tags for statements saying something is rare to non-existent. In both cases, I don't see how anyone would find a reference. For one, you need a definition of rare to actually know, but also you don't know how many documents people have written and stored on their own computers. Unless someone decides to do a random survey of all documents, it isn't likely we will ever know. I think the tags should be removed. Gah4 (talk) 21:25, 5 October 2015 (UTC)
IPsock of SlitherioFan2016.
|
---|
0x0000D800 + 0x0000DC00 = 0x00010000?[edit]On my computer, every time I type � (0x0000D800, a high surrogate) followed by � (0x0000DC00, a low surrogate), I get 𐀀 (0x00010000, a linear-B character). What causes this? 108.71.121.98 (talk) 22:02, 12 September 2016 (UTC)
UTF-32 vs UCS-4[edit]Editor 108.66.232.129 has made several edits to indicate that numbers above 0x10FFFF can be stored in UTF-32. He did find some interesting documentation that indicates that the UCS designers did in fact make some assignments in this range, mostly blocking out some very large Private Use Areas. (Note that his reference looks legit, but I don't think he is translating it correctly to code points, the reference uses confusing block/slice terms rather than simple number ranges). My impression however is that "UTF-32" means "32-bit code units but you are not allowed to set any of them to a number larger than 0x10FFFF". This is what differs it from "UCS-4" which says you can put any value (or perhaps any non-negative value) in the code units. I have tried to fix his edits so that it implies that those large characters are part of UCS-4, but not of UTF-32. However he has pretty much reverted this each time. Any opinions on who is right and if or how this should be done? Spitzak (talk) 01:24, 15 September 2016 (UTC)
Maximum character?[edit]Numbers above 0x0010FFFF can be stored in UTF-32. In fact, numbers above 0x7FFFFFFF can be stored in UTF-32. If you click this link, you can enter a number 0x00000000 to 0xFFFFFFFF, despite there only being 1114112 unicode characters. 108.66.232.44 (talk) 01:13, 21 September 2016 (UTC)
UTF-32/UCS-4[edit]The Unicode standards say that UTF-32 is a subset of UCS-4...but they are identical to each other? How is that possible? 108.71.120.222 (talk) 16:42, 26 September 2016 (UTC)
|
Requested move 27 September 2016
[edit]- The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section.
The result of the move request was: Not moved — Amakuru (talk) 10:34, 5 October 2016 (UTC)
UTF-32 → UTF-32/UCS-4 – Essentially the same encoding. 108.71.120.222 (talk) 12:28, 27 September 2016 (UTC)
- Oppose - do not see much need. UCS-4 (that redirects here and no longer very notable, as it) is just outdated/not used (strictly UTF-32 is also mostly not used, at least for strings). Do you want to do this to UTF-16 too? You could call that UTF-16/UCS-2 (while they are not the same, neither does it strictly apply to UCS-4), or maybe UTF-16/UTF-16LE? comp.arch (talk) 12:38, 27 September 2016 (UTC)
- No, I only want to do that to UTF-32. 108.71.120.222 (talk) 16:33, 27 September 2016 (UTC)
- Oppose - should stay consistent with UTF-16 page which is not called UTF-16/UCS-2.Spitzak (talk) 18:35, 27 September 2016 (UTC)
- Oppose per given UTF-16 reasoning. Would alternatively support a move to "UCS-4" if it's the WP:COMMONNAME, but surely it is not, right? (A small side reason is that the slash in the talk page (only) will interpreted by the MediaWiki as a sub-page when it probably is not. Can simply avoid that quirk/hassle by keeping the title as-is.) — Andy W. (talk · ctb) 00:03, 29 September 2016 (UTC)
- That's because UTF-16 and UCS-2 are separate encodings. UTF-32 and UCS-4 are essentially identical. 108.71.122.60 (talk) 12:20, 30 September 2016 (UTC)
- See MOS:SLASH. Would you support "UTF-32–UCS-4" (with an en-dash)? (not that I think I do at the moment) — Andy W. (talk · ctb) 14:14, 29 September 2016 (UTC)
- No. That would look bad. 108.71.121.95 (talk) 16:40, 29 September 2016 (UTC)
- It seems to me that this will fail per WP:SNOW, and is unclear how UTF-16 relation to UCS-2 is different from UTF-32 to UCS-4 (ok, guess UTF-16 is variable length, but with graphemes UTF-32 is also kind of). comp.arch (talk) 17:09, 29 September 2016 (UTC)
- See MOS:SLASH. Would you support "UTF-32–UCS-4" (with an en-dash)? (not that I think I do at the moment) — Andy W. (talk · ctb) 14:14, 29 September 2016 (UTC)
- That's because UTF-16 and UCS-2 are separate encodings. UTF-32 and UCS-4 are essentially identical. 108.71.122.60 (talk) 12:20, 30 September 2016 (UTC)
- Support - the two encodings are identical. 108.71.122.60 (talk) 12:17, 30 September 2016 (UTC)
- Oppose. Nobody refers to UCS-4 nowadays, and it is sufficient that UCS-4 redirects here, and its equivalence is briefly mentioned in the lede. BabelStone (talk) 19:09, 30 September 2016 (UTC)
- Comment. Because the two encodings are identical, it should be moved to this, and UTF-32 should redirect here. 108.71.123.175 (talk) 00:59, 1 October 2016 (UTC)
- Oppose: A redirect from UCS-4 to UTF-32 is enough. They keywords for both are bolded in the article. I (and probably the general public) understand UTF-32 as a term much more well than UCS-4. 80.221.159.67 (talk) 04:07, 1 October 2016 (UTC)
- Comment: UTF-32 and UCS-4 should redirect here because they are both the same encoding. 99.101.115.113 (talk) 19:18, 1 October 2016 (UTC)
- Oppose, largely for reasons already given, principally that the common name is UTF-32. Strictly speaking, they are not the same encoding: the space covered by UTF-32 is only the "21-bit" space of standard Unicode (so codepoints > U+10FFFF are invalid in UTF-32), and the surrogates are also invalid in UTF-32. UCS-4 is an obsolete version of the scheme which was later standardized as UTF-32. A note on that history here, and a redirect from UCS-4, suffices. -- Elphion (talk) 00:11, 3 October 2016 (UTC)
- Comment: UTF-32 and UCS-4 are now identical. They used to be different, but now they are the same. UTF-32 should redirect here, and therefore this page should be renamed UTF-32/UCS-4. 108.66.234.86 (talk) 12:43, 3 October 2016 (UTC)
- Look, you can't have it both ways. If they are now the same (citation needed), why do you say below that UCS-4 has private use areas beyond the range of UTF-32 (which is capped at U+10FFFF)? -- Elphion (talk) 13:06, 3 October 2016 (UTC)
- Because UTF-32 represents all UCS characters. 108.71.121.129 (talk) 16:57, 3 October 2016 (UTC)
- Again, if they are the same, then UCS-4 is limited (like UTF-32) to the space U+0000..U+10FFFF. Therefore, while there was once a private use area above that in the USC standard, by the actions of both standards committees that area is now gone. No one is saying that codepoints above U+10FFFF can't technically be encoded by the coding scheme, but such codepoints are not just unassigned, they are invalid. Different applications might handle them or represent them in different ways, but according to both standards, they are invalid codepoints. -- Elphion (talk) 18:07, 3 October 2016 (UTC)
- Because UTF-32 represents all UCS characters. 108.71.121.129 (talk) 16:57, 3 October 2016 (UTC)
- Look, you can't have it both ways. If they are now the same (citation needed), why do you say below that UCS-4 has private use areas beyond the range of UTF-32 (which is capped at U+10FFFF)? -- Elphion (talk) 13:06, 3 October 2016 (UTC)
- The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.
IPsock of SlitherioFan2016.
|
---|
Private Use Planes/Groups[edit]There are private use areas 0x00E00000 to 0x00FFFFFF and 0x60000000 to 0x7FFFFFFF. 108.66.232.212 (talk) 23:18, 1 October 2016 (UTC)
Why not UTF-21?[edit]Why is there no UTF-21? It would go 0x000000 to 0x1FFFFF, more than enough to cover Unicode's limit of 0x0010FFFF. 108.71.120.246 (talk) 16:05, 2 October 2016 (UTC)
Assuming 8 bit byte oriented systems, there might be interest in UTF-24. But UTF-32 should compress fairly well using the common compression algorithms. Gah4 (talk) 01:05, 3 October 2016 (UTC)
UCS-2[edit]If UCS-4 is a 31 bit encoding, then why is UCS-2 a 16 bit encoding? 108.71.120.246 (talk) 16:05, 2 October 2016 (UTC)
|
17-plane restriction
[edit]This is a probably futile attempt to clarify the situation regarding the 17-plane restriction of the Unicode and ISO/IEC 10646 standards, for the benefit of the IP editor who insists that this restriction does not exit.
ISO/IEC 10646 ("Universal Coded Character Set") was not originally restricted to seventeen planes, whereas the Unicode Standard was. This lack of synchronization between the two standards was problematic, and so in 2005 the US national body (essentially representing the Unicode Consortium) requested ISO/IEC JTC1/SC2/WG2 to change the wording of ISO/IEC 10646 to limit the code space to 17 planes (see WG2 N2920). This was discussed at the SC2/WG2 meeting held in Xiamen in January 2005 (which I attended, incidentally), and the US proposal was accepted, with Japan abstaining (see WG2 M46 Minutes pp. 50–51 and Resolution M46.12 (17-plane restriction). The relevent changes were made in ISO/IEC 10646:2003 Amendment 2, which was published in 2006, and since that time ISO/IEC 10646 has been limited to 17 planes, and only code points in the range 0 through 10FFFF excluding D800–DF00 have been valid (see ISO/IEC 10646:2014 clauses 4.57, 4.58, 9.4 and elsewhere). Therefore, prior to 2006 UCS-4 was not equivalent to UTF-32, but since 2006 UCS-4 is identical to UTF-32 as it is restricted to the code space defined in ISO/IEC 10646. We should mention this historical discrepancy in the article, but there is no need to give it undue weight. BabelStone (talk) 16:09, 3 October 2016 (UTC)
- Unicode is a 32 bit code space. 108.71.121.129 (talk) 16:58, 3 October 2016 (UTC)
- See "65,536 plane restriction" below -- Elphion (talk) 15:09, 5 October 2016 (UTC)
IPsock of SlitherioFan2016.
|
---|
65,536 plane restriction[edit]Unicode is a 32 bit space with a 65,536 plane restriction, and sometimes those 17+ planes are used in UTF-32/UCS-4. 108.71.123.25 (talk) 14:28, 5 October 2016 (UTC)
Just to be clear...[edit]The reason why Unicode has 65536 planes, is because some operating systems use the extra 11 bits for a secondary purpose. 108.66.233.160 (talk) 17:47, 8 February 2017 (UTC)
Because the first version put out by the Unicode Consortium was strictly a 2-byte code. The Chinese finally convinced them that this didn't allow for enough characters, so the surrogate method was invented to expand the character space. But the manufacturers in the Consortium flatly refused to move to a 4-byte space (the 2-byte representation was already too embedded in existing products), so Unicode was officially capped at what could be represented by UTF-16. That's the kind of tug-of-war you see with most standards: to become widely adopted, they have to satisfy requirements both from different user communities and also from manufacturers, so the final result may not be optimal from all viewpoints. -- Elphion (talk) 09:26, 12 February 2017 (UTC)
UTF-32 or UCS-4?[edit]I know it's been said before, but UTF-32/UCS-4 actually does have those huge private use areas. In fact, I have a flip phone with them. 108.65.81.240 (talk) 01:05, 6 October 2016 (UTC) |
0x00000000
[edit]Seems to be many changes, and then reversions, related to 0 or 0x00000000. (None by me, but I have been following them.) In the case of a range, I prefer both ends of the range to have the same notation. In a table, including an initializer in the appropriate language, I prefer all entries to use the same notation. (So I would vote for the 0x00000000, but I won't enter a change war.) As I understand it, though, the U+ form is the Unicode way to write such values, and that would seem appropriate here. Gah4 (talk) 22:02, 11 October 2016 (UTC)
- I don't particularly mind either way, although I dislike a disruptive IP editor trying to impose their viewpoint on the article. However, U+ notation would be wrong in this case, as U+ notation is used to represent Unicode code points, and here we are talking about the hexadecimal values of those code points in a particular encoding form, which are two very different things. So we can say "the Unicode code point U+10FFFF is represented as 0x0010FFFF in UTF-32", but it would be wrong to say "the Unicode code point U+10FFFF is represented as U+0010FFFF in UTF-32". BabelStone (talk) 22:26, 11 October 2016 (UTC)
- Besides just saying it is wrong, can you make a convincing case for it being wrong? Note, for one, that as far as I understand, UTF-32 doesn't define endianness, so we are not discussing specific bit patterns. I suspect it doesn't even require a binary representation, though others are rare for computers today. Gah4 (talk) 00:02, 13 October 2016 (UTC)
- The U+ notation is specifically designed to represent Unicode code points, which are hexadecimal numbers in the range 0 through 10FFFF (see The Unicode Standard Appendix A: Notational Conventions). Unicode code points can be realised as different numbers under different Unicode transformation formats; so for example U+10FFFF is F4 8F BF BF in UTF-8, DBFF DFFF in UTF-16, and 0010FFFF in UTF-32. These values, whether expressed in hexadecimal or decimal, are not Unicode code points, and should not be written using the U+ notation. BabelStone (talk) 11:09, 13 October 2016 (UTC)
- I agree: U+ notation is appropriate for code points, not for generic hex values. However, I think "0" suffices for 0 in text, since it is significantly easier to read. 0x0... is fine for tables. -- Elphion (talk) 22:48, 11 October 2016 (UTC)
- Can we have more discussion on this? It doesn't seem to have gone very far before the conclusion appears. Gah4 (talk) 14:05, 12 October 2016 (UTC)
- I think that 0x00000000 is better, because it's easier just for all hex entries to use the same notation. 108.65.82.240 (talk) 19:39, 12 October 2016 (UTC)
- Can we have more discussion on this? It doesn't seem to have gone very far before the conclusion appears. Gah4 (talk) 14:05, 12 October 2016 (UTC)
- I tried asking on WT:MOS, but there is no interest there in helping. Seems it isn't a common enough problem. I am not sure if there is an official way, but I suggest people explain here their choice, and the reasons behind it. After enough such posts, we can form a consensus. Seems to me that so far, people are claiming a consensus with one vote. Gah4 (talk) 00:02, 13 October 2016 (UTC)
- I think it is very simple: U+ notation for Unicode code points, and 0x notation for hexadecimal values of Unicode code points in a particular Unicode transformation format. You say that there is no consensus not to use U+ for hexadecimal values, but when you make a change to existing text the burden is on you to get consensus for the new text, and there is no consensus for your proposed change to use U+ notation in a manner for which it was not intended. BabelStone (talk) 11:16, 13 October 2016 (UTC)
- That is true, but there seems to be no consensus for the other choices, either. I posted here days before I made the change to give people a chance to comment on reasons for or against. No reasons were given.
- So I also ask for discussion for or against any of the other suggested notations, and no reasons were given.
- I went to ask in WT:MOS, when nobody else did. But okay, I see you now did comment on U+, and I will think about that one. But still no reasons for/against the other suggested notations. Gah4 (talk) 13:03, 13 October 2016 (UTC)
- I think it is very simple: U+ notation for Unicode code points, and 0x notation for hexadecimal values of Unicode code points in a particular Unicode transformation format. You say that there is no consensus not to use U+ for hexadecimal values, but when you make a change to existing text the burden is on you to get consensus for the new text, and there is no consensus for your proposed change to use U+ notation in a manner for which it was not intended. BabelStone (talk) 11:16, 13 October 2016 (UTC)
- I tried asking on WT:MOS, but there is no interest there in helping. Seems it isn't a common enough problem. I am not sure if there is an official way, but I suggest people explain here their choice, and the reasons behind it. After enough such posts, we can form a consensus. Seems to me that so far, people are claiming a consensus with one vote. Gah4 (talk) 00:02, 13 October 2016 (UTC)
IPsock of SlitherioFan2016.
|
---|
0 or 0x0000 or 0x000000 or 0x00000000?[edit]Which notation should be used in this page? 108.65.81.159 (talk) 16:41, 13 October 2016 (UTC)
Nobody's saying much because we said it above. I'll add that including all 4 bytes is completely unnecessary: since we're talking about 32-bit values, it is obviously a 32-bit number whichever format is used. I still find 0 much easier to read: all the zeros just stick in eye. -- Elphion (talk) 00:42, 15 November 2016 (UTC)
|
References
- ^ "The Unicode Standard, Appendix A" (PDF). www.unicode.org. Unicode, Inc. Retrieved 15 November 2016.
- ^ "Contact Form". www.unicode.org. Unicode, Inc. Retrieved 15 November 2016.
“This removes any speed advantage of UTF-32”
[edit]The statement “This removes any speed advantage of UTF-32” has been flagged as needing a reference for three months now. But even worse, it doesn't make sense, at least not to me. Unless someone can at least explain what it means, we should therefore just remove it. ◅ Sebastian 04:43, 11 August 2023 (UTC)
- It means an integer offset measured in code units into a variable-width encoding such as UTF-8 is just as fast as an integer offset measured in code points into a UTF-32 string. There are numerous attempts to explain this in the preceeding sentences. Spitzak (talk) 14:32, 21 August 2023 (UTC)
isyarat 36.66.90.225 (talk) 20:35, 20 August 2023 (UTC)
isyarat 114.10.113.111 (talk) 20:35, 20 August 2023 (UTC)