Jump to content

Talk:Unicode in Microsoft Windows

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Untitled

[edit]

Much of the last (utf-8) paragraph is babble. One does not require utf8 support from the OS when there is utf16 support, since the conversions between utf8 and utf16 are very simple and mechanical and do not require last tables (like other unicode functionality) 88.159.79.148 (talk) 17:39, 6 February 2016 (UTC)[reply]

fopen("string",...) does not work and cannot open all possible files, due to the fact that utf-8 conversion is not done. This is a violation of the Posix and C-99 standard. Windows is broken, stop trying to claim otherwise. Yes you can work around it by converting the strings to UTF-16 and using Windows-specific api, but it is broken in that their standard c library does not do this.Spitzak (talk) 02:15, 9 February 2016 (UTC)[reply]

Yes, chcp 65001 is a thing

[edit]

Assuming you can get your hands on Windows 10, get a Ubuntu or any WSL system from the store. Run it, and you will see that conhost reports cp65001 in the window's properties.

WSL has a Binfmt_misc hook that lets the Win32 part run exe files, inheriting the WSL's many settings. One of these settings is the code page, and it causes bugs in old Python2 versions because Python2 does not know what the 65001 code page that Windows says it is using is.

If you read the workrounds in the bug, you will see that chcp 850 is used to switch to a encoding that Python2 understands, and chcp 65001 is used to switch it back after doing so. The full commands include /mnt/c/Windows/System32/cmd.exe /C , because that's how you point to cmd under WSL.

And yes, you can reproduce that without WSL. Open up cmd in Windows 10 and install Python 3.6, and you can:

C:\Python\Python36>chcp 437
Active code page: 437

C:\Python\Python36>set PYTHONLEGACYWINDOWSSTDIO=1

C:\Python\Python36>python -c print(__import__('sys').stdout.encoding)
cp437

C:\Python\Python36>chcp 65001
Active code page: 65001

C:\Python\Python36>python -c print(__import__('sys').stdout.encoding)
cp65001

PYTHONLEGACYWINDOWSSTDIO is needed to force Python to use the local code page because of PEP-0528, which uses "utf-8" by default. Before setting the variable, Python 3.6 will always report "utf-8".

--Artoria2e5 contrib 16:24, 9 May 2018 (UTC)[reply]

Does this actually change what fopen() does as it translates bytes to UTF-16? If not then it is irrelevant to this discussion. It sounds like it leaves stuff in the environement so that Python acts different. It also possibly is changing how the wrapper makes argv/argc from the typed command, maybe.Spitzak (talk) 00:51, 10 May 2018 (UTC)[reply]
Regarding non-double-byte MBCSes: there is another four-byte-at-maximum code page in Windows called cp54936 (GB 18030). Like UTF-8, it too cannot be used for the locale or "ANSI" code page. In fact all the locale MBCS code pages are DBCS, so the likely explanation is that many programs simply cannot handle three or more bytes. --Artoria2e5 contrib 16:28, 9 May 2018 (UTC)[reply]
That might be the excuse, but I think a lot of people vastly overestimate how important it is to get the variable-width encoding right. Your example of checking a prefix on a string will not fail even if the prefix contains 3/4 byte characters or if the tested string contains them, even if they straddle the end of the prefix length and are thus "cut in half" which is often stated as some horrible dangerous result that will cause your computer to catch fire. That is wrong, all that will happen is that it will return the correct answer that the prefix does not match. It really is vastly simpler than people seem to think and paranoia is stopping I18N from being solved simply. It is true that *old* encodings were not self-syncrhonizing, which is the main thing needed to allow algorithims to work without knowing anything about the encoding. But UTF-8 and UTF-16 do not have that mistake.Spitzak (talk) 00:51, 10 May 2018 (UTC)[reply]

I still find this explanation really dubious. A couple questions:

Does this really have the ability to make fopen() accept UTF-8 and translate it to the correct UTF-16 filename? If that is not true, this command is mostly irrelevant for this discussion.
The description of a "manual command" (meaning one the user types???) that "changes the current process" (which is the shell if the user types the command, not any launched program) seems very dubious. I think this actually changes the environment temporarily for a launched program. In this case, is it really true that there is no way for a program to change it's own environment the same way? This seems really unlikely so I am strongly suspicious that this does NOT alter fopen, and may instead have something to do with the interpretation of argv/argc from the command line, which *is* done by the shell and thus it might make sense that only the manual command works and it changes the current process.

Answers from Windows experts welcome! Spitzak (talk) 18:56, 10 May 2018 (UTC)[reply]

I have done a bunch of searching. It appears that chcp changes the interpretation of byte data streams by the shell. This covers stdin/out and reading of batch files. It does not change fopen. Also it must set some environment variable that causes Python to barf, that is not a Windows bug but instead a Python one. Will try to correct the text.Spitzak (talk) 19:28, 10 May 2018 (UTC)[reply]

interpretation of byte data streams by the shell chcp changes the things for the entire console. (I was wrong about the locale thing.) Many programs, like cmd, want to be consistent with the input/output, so they follow the code page too.
it must set some environment variable that causes Python to barf Well it is because on the newer, immune-to-code-page-shit Python 3.6, an env var is necessary to restore the old behavior seen in Python 2 and Python 3.5. (READ THE PEP DOCUMENT!) (It is not needed on older Python versions, but why would I install one on my computer?) The old behavior involves asking Windows for the code page.
Okay, I should probably explain what a PEP is. A PEP (Python Enhancement Proposal) is a document that proposes a change to Python. The PEP-0528 change is applied to Python 3.6+, and I need to set the variable for Python to act with the change disabled. For Python lower than 3.6 I do not need to set a variable because using the code page was the default.
changes the environment temporarily The "enviornment" in question is the code page of the console, not some environmental variable. It it acquired by Python 2 from GetConsoleCP, which will then try to convert its unicode strings to bytes in the code page for proper writing.
for a program to change it's own environment the same way It exists. The console (chcp) part is "SetConsoleCP". The actual locale change can be done by a setlocalte("en_US.65001");, where 65001 is a code page number. (Of course, the change propagates to any parent/child programs using that console.)
Does this really have the ability to make fopen() accept UTF-8 chcp probably does not, as the "ANSI" API depends on the locale more than the console code page. The control panel setting does, because it is changing what "ANSI" is.
something to do with the interpretation of argv/argc from the command line If you go to the failed part of docopt mentioned in the GitHub issue (https://github.com/docopt/docopt/blob/6879155/docopt.py#L478), you will see that the error occured at a print() invocation. I explained how that error happens in "changes the environment temporarily".--Artoria2e5 contrib 19:54, 10 May 2018 (UTC)[reply]

mbsrev

[edit]

@Spitzak: What version of the Microsoft C runtime is your copy of mbsrev from? This might be due to the DBCS assumption I was talking about, but I guess MS should have fixed it in a newer version of their C runtime. (Otherwise they won't be confident enough to put that UTF-8 as ANSI "beta" feature in.) --Artoria2e5 contrib 19:59, 10 May 2018 (UTC)[reply]

They may very well have a working mbsrev for UTF-8. What I was looking for was what functions were changed to mbs by the "mbs" switch to the compiler. Functions that fail for single bytes but work for multi bytes are a good indication that it is possible they fail for UTF-8. Basically if there is no mbs version then it is not possible for the function to fail for UTF-8 any worse than it fails for any other MBCS. As I suspected the functions that can fail are very limited and mostly useless (really who really needs to reverse a string?).Spitzak (talk) 01:57, 11 May 2018 (UTC)[reply]
What strange reasoning is this? It doesn't matter who needs to reverse strings. A real world program is made up of a lot of libraries or external code. If you change the assumptions, code may break that you are unaware of. There is already enough broken code on the web with badly displayed UTF-8. You might be less aware of this, if you don't use characters outside of US-ASCII, but in the real world it happens a lot (too much). Correctness is important, not a question of frequency or likeliness. — Preceding unsigned comment added by 88.219.179.82 (talk) 00:56, 19 August 2018 (UTC)[reply]
Please show me an example of "broken code on the web with badly-displayed UTF-8". I think you have examples of confusion as to what encoding is used, not examples of incorrect handling of UTF-8.Spitzak (talk) 16:33, 20 August 2018 (UTC)[reply]

See also?

[edit]

I don't see the purpose of the 'Bush hid the facts' link other than to inject pointless politics into this page. It adds nothing that the article itself and the 'Mojibake' article to not cover without distraction. I vote for removal. Mespinola (talk) 01:35, 10 August 2020 (UTC)[reply]

UTF-8, UTF-16 and UCS-2 in Windows, and Microsoft products in general

[edit]

First I believe (some) of Microsoft's docs are outdated, and dates such as "02/28/2023" (likely update-date, not first date) e.g. for "Currently, the only Unicode encoding that ODBC supports is UCS-2"[1]

UCS-2 is obsolete, and either, surprisingly, ODBC does not yet support UTF-16, or UCS-2 should be read as such.

It seems clear to me Microsoft recommends UTF-8 (given their doc page title "Use UTF-8 code pages in Windows apps"), and that recommendation was moved to main text, but if it's correct, then the lead should have it too, to summarize.

"Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps." That means UTF-16 (-W API) is no longer emphasized. It's not at all that UTF-8 is just recommended over other 8-bit, but rather over the now deemphasized UTF-16.

"Win32 APIs often support both -A and -W variants." So it's clear -A is part of the API, yes not always, seemingly for all non-legacy API though: "MultiByteToWideChar and WideCharToMultiByte let you convert between UTF-8 and UTF-16 (WCHAR) (and other code pages). This is particularly useful when a legacy Win32 API might only understand WCHAR. These functions allow you to convert UTF-8 input to WCHAR to pass into a -W API and then convert any results back if necessary." comp.arch (talk) 15:57, 13 March 2023 (UTC)[reply]

I'm sorry but I'm going to need a document from Microsoft that says "stop using the W api and start using the A api" before I'm going to believe they really are recommending UTF-8 over UTF-16. All the linked documents only say that they prefer UTF-8 for the A api but say absolutely nothing about the W api which is what is used for UTF-16.Spitzak (talk) 16:06, 13 March 2023 (UTC)[reply]
So let's look at the "Use UTF-8 code pages in Windows apps" page.
The title just says that. It doesn't say "Use UTF-8 code pages, rather than UTF-16, in Windows apps"; inferring that's what it means is, well, making an inference, and inferences are not guaranteed to be true.
The first sentence of the first paragraph says "Use UTF-8 character encoding for optimal compatibility between web apps and other *nix-based platforms (Unix, Linux, and variants), minimize localization bugs, and reduce testing overhead." This, again, doesn't explicitly say you should use it instead of using the W APIs and UTF-16. It could also be read as saying "use UTF-8, rather than various non-Unicode local code pages, in cases where you're not using the W APIs" - I'm not sure how UTF-8 would minimize localization bugs and reduce testing overhead relative to UTF-16, but I can see how it would do that relative to local code pages (for example, not having to test with code pages other than 65001 means not having to do as much testing).
The "Set a process code page to UTF-8" section says "As of Windows Version 1903 (May 2019 Update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. ... With a minimum target version of Windows Version 1903, the process code page will always be UTF-8 so legacy code page detection and conversion can be avoided." - emphasis mine. "Use UTF-8 code pages" doesn't help if you have software that you want to be able to run on versions of Windows where the support for code page 65001 isn't quite as functional as you need (I'm a core developer for one such application).
Microsoft have been fixing problems with code page 65001 support in their libraries and apps over time, but it would be a Good Thing if they'd clarify which OS release and build is the first one where 1) you can set the code page to 65001 in an app and expect the Microsoft A APIs to Just Work and 2) you can set the code page to 65001 from the command line and expect the command line to work. See "Create a UTF8 C-runtime" in the Visual Studio Developer Community site; it mentions some issues, including issues with console output and command-line arguments, with the UTF-8 code page support in the Visual Studio C library.
I'd say that "force the app code page to be 65001 and use the A APIs" is a good recommendation when you can do that and have it work on all currently-supported versions of Windows, in a fashion that a UN\*X app written for a UTF-8 encoding, with non-character-encoding-related "make this work on Windows" changes made, will Just Work on Windows, including both GUI and console input and output. Guy Harris (talk) 22:53, 13 March 2023 (UTC)[reply]