Talk:C0 and C1 control codes
This article is rated List-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||
|
does anyone here have access to ISO/IEC-6429
[edit]and if so can they check the codes in the C1 table (particularlly the 3 not identified by unicode) against it? Plugwash 02:34, 23 January 2006 (UTC) ECMA 48, the european version of this standard, is available online. --Random832 23:32, 1 July 2007 (UTC)
- Supposedly ECMA-48 is identical (and is available for free). The ISO (and ANSI) documents all cost money. Tedickey (talk) 10:23, 10 March 2008 (UTC)
2024
[edit]- What are "the 3 not identified by Unicode"? The Unicode 15.1 version of the Unicode chart of C1 controls and Latin-1 Supplement, and the 1992 version of ISO/IEC 6429, have the same set of C1 controls, except that Unicode has 0x84 as IND and ISO/IEC 6429 doesn't, but, as the note attached to IND says, it was "Deprecated in 1988 and withdrawn in 1992 from ISO/IEC 6429 (1986 and 1991 respectively for ECMA-48)". I'll attach references in response to the "[citation needed]" for that.
- Otherwise, the table matches both that version of Unicode and that version of ISO/IEC 6429. Guy Harris (talk) 09:56, 29 May 2024 (UTC)
- 0x80, 0x81, and 0x99. Search below for "Notes Regarding Omissions" Spitzak (talk) 18:52, 29 May 2024 (UTC)
- OK, those aren't mentioned in ISO/IEC 6429 or ECMA-48, either; the notes in question say they were proposed for ISO 10646, but not accepted. Guy Harris (talk) 19:14, 29 May 2024 (UTC)
- 0x80, 0x81, and 0x99. Search below for "Notes Regarding Omissions" Spitzak (talk) 18:52, 29 May 2024 (UTC)
Is "String Terminator" abbreviated "SI"?
[edit]Control code 0x9C is listed as:
0x9C SI ST String Terminator
However, SI is the abbreviation for:
0x0F SI Shift In
Is the SI in String Terminator supposed to be ST?
24.234.114.35 21:34, 4 May 2007 (UTC)
- Fixed, source RFC 1345 says ST. --217.184.142.52 (talk) 19:52, 16 June 2008 (UTC)
C1 not derived from/used in ISO/IEC 8859-n
[edit]The C1 codes were included in the ISO-8859-n series of encodings [...].
I think this is wrong if ISO-8859-n means ISO/IEC 8859. I only have access to draft versions of ISO/IEC 8859, but they explicitly say (C1 code points) use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429., see here. --Abdull 08:10, 8 June 2007 (UTC)
- there is a subtule but important difference between ISO/IEC 8859-1 and the IANA charset ISO-8859-1. One is an incomplete standard without control codes the other adds them in to make a usable standard. Plugwash 21:42, 1 July 2007 (UTC)
2024
[edit]- The Unicode standard claims that code points 0x00 through 0xFF are inherited from ISO 8859-1 (not from any IANA character set), but the Unicode standard is making a false claim there; the non-draft ISO/IEC 8859-1:1998 explicitly declares all the control character code points to be out of its scope. I've updated the page to indicate where Unicode really got code points 0x00-0x1F, 0x7F, and 0x80-0x9F. Guy Harris (talk) 0:53, 29 May 2024
- And Unicode doesn't describe what most of them do; see section 23.1 "Control Codes" of the Unicode 15.0 specification. Guy Harris (talk) 00:31, 30 May 2024 (UTC)
- A reference to back up where Unicode got the code points from would be nice. DRMcCreedy (talk) 00:45, 30 May 2024 (UTC)
- "Got [them] from" in what sense? Reserving 0x00-0x1F and 0x80-0x9F for the C0 and C1 control characters, respectively, came from ISO 2022. The semantics for the few control codes to which semantics are assigned, and the character name aliases, came from ISO 6429. C0 and C1 control codes § Unicode uses section 23.1 "Control Codes" of the Unicode specification as a reference. Guy Harris (talk) 01:06, 30 May 2024 (UTC)
- The scope of ISO 8859 is, indeed, to define specific graphical character sets for use with level 1 of ISO 4873. ISO 4873, in turn, is a subset of ISO 2022. Hence, the concept of the C0 and C1 controls is defined by standards that ISO 8859 is designed to conform to. Notably, Unicode itself does not conform to either of those standards.
- The relevance of ISO 8859 isn't that ISO 8859 itself defines anything to do with control codes (it doesn't, as you correctly observe), but that Unicode finished up with the C0 and C1 control codes (despite not itself being ISO-2022-based) on account of starting off by stipulating that existing data conforming to ISO 8859-1 (which would usually use some control code set, although it would still be conformant ISO 8859-1 if it didn't) should be mapped directly to U+0000–U+00FF. It so happens that Unicode continued to be used with a subset of the C0 set from ISO 6429 (i.e. using LF or CR+LF, as opposed to Unicode's own LSEP, as the end-of-line convention), and the likes of the Unicode Line Breaking Algorithm reflect this established practice.
- It is certainly true that the control codes did not originate on account of ISO 8859, and that one would be unsuccessful trying to look for information about them in ISO 8859 itself.
- --HarJIT (talk) 15:26, 1 June 2024 (UTC)
- The 30 May 2024 edit removed the verbiage that I felt needed a reference ("Unicode inherits code points 0x00-0x1F and 0x80-0x9F from ISO/IEC 6429:1992") so my comment/request is now moot. DRMcCreedy (talk) 15:52, 1 June 2024 (UTC)
- "Got [them] from" in what sense? Reserving 0x00-0x1F and 0x80-0x9F for the C0 and C1 control characters, respectively, came from ISO 2022. The semantics for the few control codes to which semantics are assigned, and the character name aliases, came from ISO 6429. C0 and C1 control codes § Unicode uses section 23.1 "Control Codes" of the Unicode specification as a reference. Guy Harris (talk) 01:06, 30 May 2024 (UTC)
- The Unicode standard claims that code points 0x00 through 0xFF are inherited from ISO 8859-1 (not from any IANA character set), but the Unicode standard is making a false claim there; the non-draft ISO/IEC 8859-1:1998 explicitly declares all the control character code points to be out of its scope. I've updated the page to indicate where Unicode really got code points 0x00-0x1F, 0x7F, and 0x80-0x9F. Guy Harris (talk) 0:53, 29 May 2024
CUA stuff
[edit]A few of the entries describe the use of a control key as a shortcut in many Windows programs and CUA X11 programs. For example: "In many programs, a keyboard input of Ctrl-Y is a "redo" command to undo the last Ctrl-Z undo command."
That's true, but the fact that Microsoft, when porting their Office software from the Mac to their own OS, used control keystrokes as a substitute for the missing command key has nothing to do with the meaning of any control character as a C0 control code.
Even if I'm completely wrong, I can't imagine how the undo/redo meanings of ^Z/^Y could be relevant but the clipboard meanings of ^X/^C/^V, the file command meanings of ^N/^O/^S, or the select-all meaning of ^A, the find-related meanings of ^F/^G/^R, etc. --75.36.140.83 07:36, 24 September 2007 (UTC)
- That stuff appears to have been removed. Guy Harris (talk) 01:08, 30 May 2024 (UTC)
RFC 1345
[edit]Do we really need to include the RFC 1345 acronymns? Aside from some limited usage in a UNIX utility, I haven't come across any evidence that they saw use elsewhere. Caerwine Caer’s whines 22:32, 16 June 2008 (UTC)
- I'd tend to agree - though deciding whether to remove them would take some investigation Tedickey (talk) 00:43, 17 June 2008 (UTC)
Backspace
[edit]The comments about backspace, and its linked topic do not mention its use for underlining and bold. The comment in the table is rather crowded, but rather than a blanket "deprecated", the point should be made that while composition of characters is not generally supported in terminals, the underline/bold generally are Tedickey (talk) 12:19, 19 June 2008 (UTC)
I think the description of Backspace is incorrect. This character have not different uses for input and output (the same way of CR or ESC characters, for example): it always move the cursor leftwards, so the phrase "To provide disambiguation between the two potential uses of backspace" have no sense.
A more precise description could be one in the same style of CR or ESC characters, for example:
Move the cursor one position leftwards. The Backspace key on a keyboard will send this character that is usually used to delete the character to the left of the cursor; to do that the three character sequence BS SPACE BS (0x08 0x20 0x08) is used. In early computer technology, where a character once printed could not be erased, the backspace was sometimes used to generate combinations of two characters, like à that could be produced using the three character sequence a BS ` (0x61 0x08 0x60), the method to print underline or overstrike characters combining _ or - with any character, or the standard method in APL programming language to create new operators combining two existing operators, like / BS - Aacini (talk) 05:35, 2 November 2008 (UTC)
- agree Tedickey (talk) 18:44, 2 November 2008 (UTC)
This article is not about all control characters
[edit]Just a friendly reminder. This article is not about every possible usage of a control character, nor even about usage on every system where 00HEX–1FHEX are control characters. This is about a specific set control characters, the C0 and C1 sets as defined by ISO/IEC 2022. Some of those meanings are generalized, so while instances where an application or system further defines their usage are relevant, a use which is totally unrelated to the character as defined in ISO/IEC 2022 belongs in either a separate article or in control character. Caerwine Caer’s whines 02:58, 12 July 2008 (UTC)
unclear lines
[edit]The section C1 (ISO 8859 and Unicode) will become clearer if "if being used in an environment where 8-bit characters are not supported or where these octets are being used instead to add additional graphics characters" is removed. Also, I have passed a '+' outside the parentheses in a table column label. —Preceding unsigned comment added by 122.169.5.54 (talk) 08:46, 12 January 2010 (UTC)
- The sentence could be broken up, but removing it would lose the hint for why 7-bit controls are useful. (Sending 2 bytes instead of 1 is not necessarily a good thing). Tedickey (talk) 09:33, 12 January 2010 (UTC)
C1 (ISO 8859 and Unicode)
[edit]I renamed the heading "C1 (ISO 8859 and Unicode)" as "C1 set" since C1 is not defined in either ISO 8859 or Unicode. C0 and C1 can be used in ISO 8859 or Unicode text, but they don't define C0 or C1. — Preceding unsigned comment added by 88.112.175.168 (talk) 10:06, 27 September 2011 (UTC)
- And so what is «C0 Controls and Basic Latin» and «C1 Controls and Latin-1 Supplement» in Unicode standard?
- http://www.unicode.org/charts/PDF/U0000.pdf
- http://www.unicode.org/charts/PDF/U0080.pdf — Preceding unsigned comment added by 84.97.14.22 (talk) 06:27, 19 July 2012 (UTC)
- ECMA-35 and ECMA-48 define the use of C0/C1 for ISO-8859-1. Without a document such as that for Unicode (or UTF-8), all the documents that you have mentioned do is to show pictures of the codes that are mapped from ISO-8859-1; the C0/C1 behavior has not been specified. A reliable source on the matter would not leave leeway for guessing what might be meant TEDickey (talk) 08:16, 19 July 2012 (UTC)
- I just want say Unicode standard
- recognize those values as control character,
- gives their range and aliases
- as character, implicitely attributes them a byte sequence depending on the UTF in use.
- Might be you just want to say that Unicode does not specify the exact behavior of each control character.
- Additionaly, a link can be established to Unicode control characters.
- In The Unicode Standard, Version 6.1 page 23, they say: Basic Type control is «Usage defined by protocols or standards outside the Unicode Standard», and classifies them as category Cc with status abstract character.
- And they add «Control Codes. Sixty-five code points (U+0000..U+001F and U+007F..U+009F) are defined specifically as control codes, for compatibility with the C0 and C1 control codes of the ISO/IEC 2022 framework. A few of these control codes are given specific interpretations by the Unicode Standard. (See Section 16.1, Control Codes.)»
- §16.1 is in page 544 for C0.
- In page 545 an additional semantic is clarified for at least eleven of them «Specification of Control Code Semantics» — Preceding unsigned comment added by 84.97.14.22 (talk) 11:18, 19 July 2012 (UTC)
- I just want say Unicode standard
- But that's the point: the paragraph as written states that Unicode "provides" these codes, but it is in a context (and no clarification is made there) to point out that Unicode provides no definition of their behavior. The C1 codes without being translated would be illegal in UTF-8 encoding (because the values in 128-159 are continuation bytes). Without clarification, the paragraph is misleading. The word "provides" is inappropriate in this context - "assigns" would be more idiomatic, and corresponds to the sources you indicate TEDickey (talk) 22:32, 19 July 2012 (UTC)
- C1 is not illegal in UTF-8. U+0085 (NEL / Next Line) is encoded as C2 85 in UTF8. I found this document which suggests that:
“ | NEL is the only C1 character recognized by Unicode | ” |
- I don't know if that claim is true. But I tested a number of terminal emulators, and GNU Screen and Mosh were the only terminal emulators I tested that supported C2 85 as a newline character. --Hirsutism (talk) 21:07, 11 October 2012 (UTC)
- Screen isn't a terminal emulator; nor is mosh - they're applications which use terminals and rely upon those to provide a lot of the functionality associated with a terminal emulator. TEDickey (talk) 21:31, 11 October 2012 (UTC)
- Yes, Mosh does do terminal emulation. See here: "... the opportunity to build a clean UTF-8 terminal emulator from scratch ...". Mosh significantly reinterprets control characters and escape sequences, before sending them to the final terminal emulator. -Hirsutism (talk) 22:36, 11 October 2012 (UTC)
- I'm aware of the opinion of its developer(s), but since it relies on the terminal (and ncurses) for the functionality, it's like screen - a translator which isn't a complete terminal emulator. You're not likely to find an authoritative source which agrees with that opinion. TEDickey (talk) 22:56, 11 October 2012 (UTC)
- We're getting stuck in a side-tangent here. The precise definition of "terminal emulator" isn't important for this Wikipedia page. What matters here is: Putty + Mosh recognize NEL (encoded as C2 85) as a newline character. Even this empirical evidence is a side-tangent... the main discussion is about whether the Unicode spec fully recognizes NEL (or other C1 characters). --Hirsutism (talk) 15:28, 12 October 2012 (UTC)
- Sure. But your suggested source isn't what one might term authoritative, due to several simple errors. For example, on the paragraph following the one you're interested in, he states
which is incorrect. Scanning quickly, I see other errors. If you're simply stating that you can find someone agreeing with your point, that's easily done of course (google is your friend). TEDickey (talk) 23:03, 12 October 2012 (UTC)Since VT100 (that uses C1 extensively)...
Octal
[edit]Would anyone object were we to add Octal to the table also? We already have decimal and hex. Maratrean (talk) 08:16, 29 October 2011 (UTC)
- Octal is wonderful, but hasn't its time passed? An extra column would be quite confusing, so why add it? There are probably lots of people who really have no interest in octal, so I think a good reason for adding it would be needed. Johnuniq (talk) 09:10, 29 October 2011 (UTC)
- I object too. Of course, octal is derived from hex (or decimal), so it would just be a dependent addition (deriveable). Of course one can add: so is decimal - all right. Only, decimal is used directly nowadays (e.g. when entering by keyboard). Someone else could argue: hey letys add UTF-8, UTF-16, and such. So I do object. -DePiep (talk) 22:14, 30 October 2011 (UTC)
The 'C' column includes many missing entries. In the language 'C' it is ordinary to use octal escape sequences to express and enter these missing entries. Why not fill out the missing entries in the C column in octal - such as '\003' - solves the OP, completes the column, and provides a reference to programmers wishing to use the control codes under discussion. — Preceding unsigned comment added by 92.21.236.161 (talk) 00:20, 5 February 2015 (UTC)
7F
[edit]7F is delete. Which control code operates this? Kg pwn (talk) 22:55, 14 June 2012 (UTC)
- In Unix, it's sometimes referred to as "Ctrl-?" or "^?"... AnonMoos (talk) 05:25, 15 June 2012 (UTC)
- Yeah, but is it like... C2... or something — Preceding unsigned comment added by Kg pwn (talk • contribs) 19:25, 1 August 2012 (UTC)
Neither - ECMA-35 / ISO-2022 make SPACE and DELETE special cases (not control characters, and not a member of C0/C1). The positions used for those in the 128-255 range are printable characters, by the way. TEDickey (talk) 23:55, 1 August 2012 (UTC)
Restructuration
[edit]I suggest to restructure this article, as is:
- Principles
- (why control codes)
- History
- (main dates)
- Interoperability
- Main standards interoperability issues
- utf-8, windows-1252, etc.
- Main protocols and applications
- terminal, file text, unix, videotext, etc
- Main standards interoperability issues
- Code assignations
- C0 set
- C1 set
- Example of sequence using control code — Preceding unsigned comment added by 84.97.14.22 (talk) 17:25, 19 July 2012 (UTC)
Various standards
[edit]http://www.itscj.ipsj.or.jp/ISO-IR/2-6.htm — Preceding unsigned comment added by 77.198.9.102 (talk) 23:21, 24 July 2012 (UTC)
^X links
[edit]These links are all circular, or point to articles about usage of shortcut combinations on Windows, which has nothing to do with control codes. I recommend reverting the addition of them.Spitzak (talk) 05:20, 21 September 2013 (UTC)
- I partially agree with your observation, but not with your conclusion.
- I deliberately put the links in because semantically there is a difference between a control character given in notation ^X (specifies a key combination with Ctrl, not a specific function - associated functions are operating system and application specific), a control character given in notation \x (specific formatting to some programming languages), named control characters distinguished by function (Linefeed, Tabulator, Bell, Null) or named control characters distinguished by code (NUL, ETX, etc.) in specific standards like ASCII etc.
- While not being circular, at present some of the links have the same target (which often does not reflect above semantics correctly), but this is a problem of sub-optimal target linking in redirects rather than a problem of adding local links to the terms as is. We will have to retarget some redirects and restructure some articles to create semantically more correct link targets, but this won't happen overnight. However, we will create awareness for this "unevenness" only by starting to incorporate the links - over time, this will create a momentum which will help to shift the targets to be more semantically correct. If we don't add the links, neither the semantically differences nor the structure will become apparent to most users, so changes in this area would happen only randomly and without a clear direction rather than systematically following some overall structure.
- --Matthiaspaul (talk) 11:12, 21 September 2013 (UTC)
- The ^X notation actually indicates the character with the value of an ASCII 'X' xor'd with 0x40. Although often the same it is not a symbol for the key sequence. For instance ^@ means a character that is more likely produced by typing ctrl+space. In any case I think links leading to discussion of Windows shortcuts are wrong, these shortcuts are processed directly from keyboard input and at no point is a C0/C1 control code ever used.Spitzak (talk) 01:52, 29 May 2014 (UTC)
Purpose
[edit]What this article doesn't really make clear is why C0 and C1 are in Unicode. The use of U+2400 ... U+243F is immediately obvious, and I guess it makes some sense to reserve NUL, TAB, CR and LF.
But what are you supposed to do when you encounter SI? Obviously you aren't meant to switch to a different character set, because if people wanted to encode a character not in Unicode they'd use a PUA character. Maybe it's part of a quoted string of bytes to send to some machine for which SI does make sense? No, because then you'd use the visual representation ␏.
If you find BEL, are you supposed to sound a bell? Of course not. A Unicode text is just that, text, not a string of instructions to do something. Even when displayed, it tends to be scrollable and no bell moment exists. And you wouldn't want to allow text to ring bells anyway. Again, for quoted bytes there's the visual representation.
What about SOH? Again, meaningless in text unless quoted. Most of these control codes are useless as part of text. Insofar as they make sense at all, it's as formatting, which isn't within the Unicode scope, but within things like HTML and CSS, or whatever format your word processor uses. The only reason it makes sense to reserve NUL, TAB, CR and LF is the sheer ubiquity of simple file formats (we call them text files, but they do contain formatting in addition to text) and memory representations of strings that need these.
So the question is, what is the purpose of the C0 and C1 control codes? — Preceding unsigned comment added by 82.139.81.0 (talk) 18:44, 28 May 2014 (UTC)
- They're in Unicode to preserve compatibility with ASCII etc. character sets. AnonMoos (talk) 03:36, 7 February 2015 (UTC)
- C1 comes from ISO-6429 (aka EMCA-48), and ISO-2022 (aka ECMA-35). It is not so much for compatibility (since the Unicode standard merely lists the names without attempting to describe functionality) as because ISO10646 grew out of the standardization work for the older encodings. Because Unicode does not describe functionality, it does not standardize C0/C1, merely makes a few assumptions relying upon those other documents as the relevant standards TEDickey (talk) 12:05, 7 February 2015 (UTC)
sources discussing smtp rather than ISO 10646
[edit]The given sources are discussing smtp rather ISO 10646 as such:
The following is a draft for an RFC updating SMTP to allow and encourage use of ISO 10646 (now DIS, of course).
and without a more suitable supplementary source, the statements do not match the source TEDickey (talk) 23:55, 7 April 2015 (UTC)
- If you read this paragraph:
- In Internet messages, the dynamic compaction method (compaction method 5) is used, the initial state being G=32, P=32, R=32, with each octet specifying a value of C. (Translated into normal English, that sentence means: "The text is in 8-bit Latin-1 until we get to the first HOP, if any!") Transitions to other character sets, represented by rows and, in some cases, planes, is done with a sequence that begins with the HOP ("High Octet Preset") code (decimal 129). The SGCI ("Single Graphic Character Introducer") is not used (i.e. we use "level 1" of method 5).
- It's pretty clear to me it is discussing how the ISO 10646 draft is applied to SMTP. It's not introducing HOP or SGCI itself, it is pulling them from the draft. It would be great if someone could find old ISO 10646 drafts and we could quote them instead, but even in the absence of copies of those old drafts, I don't think there is any other plausible interpretation of this paragraph. SJK (talk) 12:23, 9 April 2015 (UTC)
Without the said draft, you cannot distinguish the interpretation which you wish to make from an equally plausible one that refers to some ISO-2022 feature which is commented upon as not being in ISO 10646. As such, your commentary in the topic amounts to original research. As I said, you need a supplementary source to provide the information rather than interpreting TEDickey (talk) 00:43, 10 April 2015 (UTC)
Please see Ken Whistler, Formal Name Aliases for Control Characters, L2/11-281, Unicode Consortium, July 20, 2011, which explains the situation much better than my previous reference did:
Notes Regarding Omissions I have deliberately omitted three control code names and their abbreviations which occur in one (obsolete) RFC, but which are an artifact of early unapproved drafts of 10646. To wit: 0080 PADDING CHARACTER (PAD) 0081 HIGH OCTET PRESET (HOP) 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGC) Those 3 were proposed (on spec) in early drafts of 10646, for what became a failed architectural direction for 10646. They would be completely forgotten now except for the persistent (and pernicious) RFC that lists them without indicating their failed status. Nobody has ever implemented them, so they are nothing more than character encoding curiosities.
So this reference justifies my inference as correct. I will replace my prior reference with this one. SJK (talk) 10:52, 10 April 2015 (UTC)
Missing information
[edit]These control codes had names in Unicode 1.0 but these names were later removed. The article should explain when and why.
10646-1 forbids the use of C1 controls, requiring an ESC FE sequence instead. The article should detail when and why this came about and whether or not it is still in force in Unicode. — Preceding unsigned comment added by 82.139.82.82 (talk) 03:22, 6 September 2015 (UTC)
- That (ESC Fe) was made obsolete a long time ago, and removed. See this for example. TEDickey (talk) 12:55, 6 September 2015 (UTC)
merge vs deletion
[edit]While it's interesting that Unicode has a subset of C0/C1 codes, deleting most of the content of this topic to replace it by a redirect to a summary paragraph should have some discussion involving the editors who've been maintaining the page. TEDickey (talk) 08:28, 4 August 2016 (UTC)
C1 control pictures
[edit]Why are there no C1 control pictures in the UCS? 1234qwer1234qwer4 (talk) 15:19, 2 June 2019 (UTC)
- For instance this? Likely disinterest on the part of the committee members who were not involved in software development TEDickey (talk) 16:25, 2 June 2019 (UTC)
- The Unicode Public General Mail List is probably a better place to ask this question. Google
"c1 control pictures" site:unicode.org
to see the discussions that have already taken place. If your question is "Why do C0 controls get pictures but not C1 controls?" then the short answer is compatibility with a legacy encoding that had C0 control pictures. DRMcCreedy (talk) 16:31, 2 June 2019 (UTC)
- Actually, asking on a mailing list can get mixed results. If I wanted to know, I'd ask Frank. Either way, unless someone points to a mail-archive discussing the relevant issues, the best you'd get would be a primary source (unsuitable for topic development). TEDickey (talk) 19:15, 2 June 2019 (UTC)
What does C0 and C1 mean? Where did it came from? Are there also C2, C3? or did these exist?
[edit]I'd like to see the article explain the origin of the terms "C0" and "C1" and answers all these questions. --RokerHRO (talk) 16:25, 14 April 2020 (UTC)
- See C0 and C1 control codes § C1 controls:
In 1973, ECMA-35 and ISO 2022[1] attempted to define a method so an 8-bit "extended ASCII" code could be converted to a corresponding 7-bit code, and vice versa.[2] In a 7-bit environment, the Shift Out (SO) would change the meaning of the 96 bytes 0x20 through 0x7F[a][4] (i.e. all but the C0 control codes), to be the characters that an 8-bit environment would print if it used the same code with the high bit set. This meant that the range 0x80 through 0x9F could not be printed in a 7-bit environment,[2] thus it was decided that no alternative character set could use them, and that these codes should be additional control codes, which become known as the C1 control codes. To allow a 7-bit environment to use these new controls, the sequences
ESC @
throughESC _
were to be considered equivalent.[2] The later ISO 8859 standards abandoned support for 7-bit codes, but preserved this range of control characters.- There are only C0 and C1, but ECMA-35/ISO 2022 allow selection of four graphic code sets, G0 through G3, with G0 being the ASCII graphic characters by default.-- 03:05, 29 May 2024 Guy Harris
References
- ^ ECMA/TC 1 (1973). "Brief History". 7-bit Input/Output Coded Character Set (PDF) (4th ed.). ECMA. ECMA-6:1973.
{{citation}}
: CS1 maint: numeric names: authors list (link) - ^ a b c ECMA/TC 1 (1971). "8.2: Correspondence between the 7-bit Code and an 8-bit Code". Extension of the 7-bit Coded Character Set (PDF) (1st ed.). ECMA. pp. 21–24. ECMA-35:1971.
{{citation}}
: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (1973). "4.2: Specific Control Characters". 7-bit Input/Output Coded Character Set (PDF) (4th ed.). ECMA. p. 16. ECMA-6:1973.
{{citation}}
: CS1 maint: numeric names: authors list (link) - ^ ECMA/TC 1 (1985). "5.3.8: Sets of 96 graphic characters". Code Extension Techniques (PDF) (4th ed.). ECMA. pp. 17–18. ECMA-35:1985.
{{citation}}
: CS1 maint: numeric names: authors list (link)
JSON_streaming#Record_separator-delimited_JSON
[edit]I'd like to add a link to JSON streaming#Record separator-delimited JSON but I am unsure where it would fit best. --RokerHRO (talk) 22:40, 5 March 2021 (UTC)
- Perhaps in the rightmost column of the table in C0 and C1 control codes#Basic ASCII control codes - there's a big box for FS/GS/RS/US, mentioning various uses of those control characters. Guy Harris (talk) 22:59, 5 March 2021 (UTC)
State machines
[edit]This text in C0 codes is certainly anachronistic and arguably simply wrong:
- This large number of codes was desirable at the time, as multi-byte controls would require implementation of a state machine in the terminal, which was very difficult with contemporary electronics and mechanical terminals
State machines per se were neither difficult nor expensive. Shift states were required for existing coding systems such as BAUDOT, and were significantly less complex than the shift registers already needed for sending and receiving serial communication.
A state machine that could interpret VT-100 style escape sequences however would have been prohibitive in 1964.
The prime reason for avoiding shift states (or state machines in general) was to cope better with unreliable transmission, though I don't have a citation for that.
To describe 32 as a "large number" is laughable compared with the hundreds of controls that are implemented as sequences of bytes by typical terminal emulators.
Bitwise interpretation of ASCII codes | |
---|---|
Maybe this table might be useful in an article, once we've figured out which article | |
bits | meaning |
0000000 1111111 |
no action; ignored |
00_____ | controls |
__00___ | Transmission controls, affecting DCEs |
__01___ | layout controls, driving the motors in printers |
__10___ | Terminal controls, including shift states and device-specific functions |
__11___ | File format markers |
01_____ | Digits & punctuation |
1______ | Letters |
_0_____ | Upper-case |
_1_____ | Lower-case |
Although ASCII was designed as a coding system for transmission, unlike previous coding systems it could also function as an encoding for computation, with each printable character fitting into a single machine word ("byte", as we would know it today). This meant that there were needed to be in excess of 64 codes, dictating a minimum of 7 bits.
As only around 80-90 graphic characters were envisaged, it would have seemed foolhardy to "skim" on control codes; clearly at least 16 would be useful.
As there are broadly 4 classes of control codes, and a need for at least 5 transmission controls and 6 format controls, it made logical sense to reserve 4 groups of 8 codes, or 32 in all.
The eventual ASCII standard included codes that deviated from this simple arrangement, but this initial framework is still plain to see.
Martin Kealey (talk) 03:04, 13 August 2022 (UTC)
Space is not a motion control character
[edit]It is a whitespace character. Which on computers is a normal character like a or z.
Moving right is a completely different action that does not create a character or change a text string. If it is not caught and handled by the input-handling program, then at best it is mapped to other characters and displayed in a safe way, and at worst, will mess up the terminal.
It seems, many younger people and computer illiterates misunderstand the space character in severe ways that are harmful to everyone, because they still think in terms of writing on paper.
Please do not spread that, and keep it to people who print out the Internet and are used by iDevices.
-- 2A02:3035:610:58B8:24DA:BC5F:806D:B752 (talk) 21:08, 28 May 2024 (UTC)
- I presume you're referring here to the entry in the first table, with SP described as "[Moving] right one character position."
- Many older people remember printing terminals, in which the space character moved the print head one position to the right, and changes nothing on the paper. Those were the majority of terminals when ASCII was developed. Page 6 of the 1963 ASCII spec speaks of the character in the 0x20 position as "Word separator [space, normally non-printing]". :It does, however, refer to it as a graphic character on page 11.
- The 1968 version also describes space as "normally non-printing", but puts it in the "Graphic Characters" section rather than the "Control Characters" section. It says that is
A normally non-printing graphic character used to separate words. It is also a format effector which controls the movement of the printing position, one printing position forward. (Applicable also to display devices.)
- but does not specify in what fashion it's "applicable ... to display devices".
- Display terminals usually erased the character at the current display position, and moved one position to the right, when they received a space character, and most if not all terminal emulator programs emulate terminals of that sort. (The Datapoint 3300, however, appears to have supported both "space overwrite" and non-"space overwrite" behavior, perhaps because it was intended to be a replacement for ASCII Teletypes such as the Teletype Model 33, so, while it probably didn't support full overprinting, you could at least overwrite a space with another character.[1])
- The right thing to do would probably be to expand "Move right one character position." to something such as "Move right one character position; on display terminals, this usually erases the character at the current character position." Guy Harris (talk) 23:27, 28 May 2024 (UTC)
References
- ^ Datapoint 3300 / Instructions (PDF). p. 6.