International ascii code table. ASCII (American standard code for information interchange) encoding - basic text encoding for Latin

03.08.2019 Windows 7, XP

A computer understands the process of its transformation into a form that allows organizing a more convenient transfer, storage or automatic processing of this data. For this purpose, various tables are used. The ASCII encoding is the first system developed in the United States for working with English-language text, which subsequently became widespread throughout the world. The article below is devoted to its description, features, properties and further use.

Display and storage of information in a computer

Symbols on a computer monitor or one or another mobile digital gadget are formed on the basis of sets of vector forms of all kinds of signs and a code that allows you to find among them the symbol that needs to be inserted in the right place. It is a series of bits. Thus, each character must uniquely correspond to a set of zeros and ones, which stand in a specific, unique order.

How it all began

Historically, the first computers were in English. To encode symbolic information in them, it was enough to use only 7 bits of memory, while for this purpose 1 byte, consisting of 8 bits, was allocated. The number of characters understood by the computer in this case was equal to 128. The number of such characters included the English alphabet with its punctuation marks, numbers and some special characters. The English-language seven-bit encoding with the corresponding table (code page), developed in 1963, was named the American Standard Code for Information Interchange. Usually the abbreviation "ASCII encoding" was used to denote it and is still used to this day.

Transition to multilingualism

Over time, computers have become widely used in non-English-speaking countries as well. In this regard, there was a need for encodings that would allow the use of national languages. It was decided not to reinvent the wheel, and to take ASCII as a basis. The encoding table in the new edition has expanded significantly. The use of the 8th bit made it possible to translate 256 characters into the computer language.

Description

ASCII encoding has a table that is divided into 2 parts. Only the first half is considered to be the generally accepted international standard. It includes:

Characters with ordinal numbers from 0 to 31, encoded by sequences from 00000000 to 00011111. They are reserved for control characters that control the process of displaying text on the screen or printer, giving a sound signal, etc.
Characters with NN in the table from 32 to 127, coded by sequences from 00100000 to 01111111, constitute the standard part of the table. These include a space (N 32), letters of the Latin alphabet (lowercase and uppercase), ten-digit numbers from 0 to 9, punctuation marks, brackets of different styles and other symbols.
Characters with ordinal numbers from 128 to 255, encoded by sequences from 10000000 to 11111111. These include letters of national alphabets other than Latin. It is this alternative part of the table that ASCII encoding is used to convert Russian characters into computer form.

Some properties

The peculiarities of the ASCII encoding include the difference between the letters "A" - "Z" of lower and upper cases by only one bit. This circumstance greatly simplifies the conversion of the register, as well as its check for belonging to the specified range of values. In addition, all letters in the ASCII encoding system are represented by their own ordinal numbers in the alphabet, which are written in 5 digits in the binary system, preceded by 011 2 for lower case letters and 010 2 for upper case.

Among the features of the ASCII encoding can be considered the representation of 10 digits - "0" - "9". In the second number system, they start with 00112 and end with 2 numbers. For example, 0101 2 is equivalent to decimal five, so the character "5" is written as 0011 01012. Based on this, you can easily convert BCDs to an ASCII string by adding 00112 to each nibble on the left.

"Unicode"

As you know, thousands of characters are required to display texts in the languages of the Southeast Asian group. Such a number of them is in no way described in one byte of information, so even extended ASCII versions could no longer satisfy the increased needs of users from different countries.

Thus, the need arose to create a universal text encoding, which was developed by the Unicode consortium in cooperation with many leaders of the global IT industry. Its specialists created the UTF 32 system. In it, 32 bits were allocated for encoding 1 character, making up 4 bytes of information. The main drawback was a sharp increase in the amount of required memory by as much as 4 times, which entailed many problems.

At the same time, for most countries with official languages belonging to the Indo-European group, the number of characters equal to 2 32 is more than redundant.

As a result of further work of specialists from the Unicode consortium, the UTF-16 encoding appeared. It became the option for transforming symbolic information that suited everyone both in terms of the amount of required memory and the number of encoded characters. That is why UTF-16 was accepted by default and it requires 2 bytes to be reserved for one character.

Even this rather advanced and successful version of "Unicode" had some drawbacks, and after the transition from the extended version of ASCII to UTF-16, it doubled the weight of the document.

In this regard, it was decided to use the variable length encoding UTF-8. In this case, each character of the source text is encoded with a sequence of 1 to 6 bytes long.

Relationship with American standard code for information interchange

All characters of the Latin alphabet in UTF-8 of variable length are encoded in 1 byte, as in the ASCII encoding system.

The peculiarity of UTP-8 is that in the case of a text in Latin without using other characters, even programs that do not understand Unicode will still allow you to read it. In other words, the basic part of the ASCII text encoding is simply merged into the new variable-length UTF. Cyrillic characters in UTP-8 occupy 2 bytes, and, for example, Georgian ones - 3 bytes. The creation of UTF-16 and 8 solved the main problem of creating a single code space in fonts. Since then, font manufacturers can only fill the table with vector forms of text characters based on their needs.

Different encodings are preferred on different operating systems. To be able to read and edit texts typed in a different encoding, Russian text conversion programs are used. Some text editors contain built-in transcoders and allow you to read text regardless of encoding.

Now you know how many characters are in ASCII and how and why it was developed. Of course, today the most widespread standard in the world is "Unicode". However, we must not forget that it was created on the basis of ASCII, therefore, the contribution of its developers to the field of IT should be appreciated.

According to the International Telecommunication Union, in 2016, three and a half billion people used the Internet with varying regularity. Most of them do not even think about the fact that any messages sent by them via PCs or mobile gadgets, as well as texts that are displayed on all kinds of monitors, are actually combinations of 0 and 1. This presentation of information is called encoding. It provides and greatly facilitates its storage, processing and transmission. In 1963, the American ASCII encoding was developed, which this article is devoted to.

Presentation of information in a computer

From the point of view of any electronic computer, text is a collection of individual characters. These include not only letters, including capital letters, but also punctuation marks and numbers. In addition, special characters "=", "&", "(" and spaces are used.

The set of symbols that make up the text is called the alphabet, and their number is called the cardinality (denoted as N). To define it, the expression N = 2 ^ b is used, where b is the number of bits or the informational weight of a particular character.

It has been proven that an alphabet with a capacity of 256 characters can represent all the necessary characters.

Since 256 is the 8th power of two, the weight of each character is 8 bits.

The unit of measurement of 8 bits is called 1 byte, so it is customary to say that any character in a text stored on a computer takes up one byte of memory.

How is coding done

Any texts are entered into the memory of a personal computer by means of keyboard keys on which numbers, letters, punctuation marks and other symbols are written. They are transferred to the RAM in a binary code, that is, each character is associated with a decimal code familiar to humans, from 0 to 255, which corresponds to a binary code - from 00000000 to 11111111.

Byte character encoding allows the text processor to access each character separately. At the same time, 256 characters are enough to represent any character information.

ASCII character encoding

This abbreviation in English stands for code for information interchange.

Even at the dawn of computerization, it became obvious that you can come up with a wide variety of ways to encode information. However, to transfer information from one computer to another, it was required to develop a single standard. So, in 1963, an ASCII encoding table appeared in the United States. In it, any symbol of the computer alphabet is associated with its ordinal number in binary representation. Initially, ASCII was used only in the United States and later became the international standard for PCs.

ASCII codes are divided into 2 parts. Only the first half of this table is considered an International Standard. It includes characters with ordinal numbers from 0 (encoded as 00000000) to 127 (code 01111111).

Serial number	ASCII text encoding	Symbol
	0000 0000 - 0001 1111	Characters with N from 0 to 31 are called control characters. Their function is to "guide" the process of displaying text on a monitor or printing device, giving a sound signal, etc.
	0010 0000 - 0111 1111	Characters with N from 32 to 127 (standard part of the table) - upper and lower case letters of the Latin alphabet, 10 digits, punctuation marks, as well as various brackets, commercial and other symbols. The character 32 denotes a space.
	1000 0000 - 1111 1111	Characters with N from 128 to 255 (alternative part of the table or code page) can have different variants, each of which has its own number. The code page is used to specify national alphabets that are different from Latin. In particular, it is with its help that ASCII encoding for Russian characters is carried out.

In the encoding table, uppercase and follow one after another in alphabetical order, and numbers - in ascending order of values. This principle also applies to the Russian alphabet.

Control characters

The ASCII encoding table was originally created to receive and transmit information on such a device that has not been used for a long time, such as a teletype. In this regard, non-printable characters have been included in the character set, used as commands to control this device. Similar commands were used in such pre-computer messaging methods as Morse code, etc.

The most common "teletype" character is NUL (00, "zero"). It is still used in most programming languages to this day, denoting a line terminator.

Where is ASCII encoding used?

The US Standard Code is needed for more than just entering text information from the keyboard. It is also used in graphics. Specifically, in ASCII Art Maker, images of different extensions represent a spectrum of ASCII characters.

Such products are of two types: they perform the function of graphic editors by converting images into text, and converting "pictures" into ASCII graphics. For example, the famous emoticon is a prime example of an encoding character.

ASCII can also be used when creating an HTML document. In this case, you can enter a certain set of characters, and when viewing the page, a character will appear on the screen that corresponds to this code.

ASCII is also necessary for the creation of multilingual sites, since characters that are not included in a specific national table are replaced by ASCII codes.

Some features

To encode text information in ASCII encoding, 7 bits were originally used (one was left empty), but today it works as 8-bit.

The letters in the top and bottom columns differ from each other by only one single bit. This greatly reduces the complexity of the check.

Using ASCII in Microsoft Office

If necessary, this type of text encoding can be used in Microsoft text editors such as Notepad and Office Word. However, when typing in this case, it will not be possible to use some functions. For example, you will not be able to bold, because ASCII only preserves the meaning of the information, ignoring its general appearance and shape.

Standardization

The ISO organization has adopted the ISO 8859 standards. This group defines eight-bit encodings for different language groups. Specifically, ISO 8859-1 is Extended ASCII, which is a table for the United States and Western Europe. And ISO 8859-5 is a table used for the Cyrillic alphabet, including the Russian language.

For a number of historical reasons, the ISO 8859-5 standard has been in use for a very short time.

For the Russian language, at the moment, encodings are actually used:

CP866 (Code Page 866) or DOS, which is often referred to as the alternative GOST encoding. It was actively used until the mid-90s of the last century. At the moment, it is practically not used.
KOI-8. The encoding was developed in the 1970-80s, and at the moment it is a generally accepted standard for mail messages on the Runet. It is widely used in the OS of the Unix family, including Linux. The "Russian" version of KOI-8 is called KOI-8R. In addition, there are versions for other Cyrillic languages, such as Ukrainian.
Code Page 1251 (CP 1251, Windows - 1251). Developed by Microsoft to provide support for the Russian language in the Windows environment.

The main advantage of the first CP866 standard was the preservation of pseudographic characters in the same positions as in Extended ASCII. This made it possible to run without changes foreign-made text programs, such as the well-known Norton Commander. At the moment, CP866 is used for programs developed under Windows that work in full-screen text mode or in text windows, including FAR Manager.

Computer texts written in the CP866 encoding are quite rare lately, but it is precisely this encoding that is used for Russian file names in Windows.

"Unicode"

At the moment, it is this encoding that has received the most widespread use. Unicode codes are divided into areas. The first (U + 0000 to U + 007F) includes ASCII characters with codes. This is followed by the areas of signs of various national scripts, as well as punctuation marks and technical symbols. In addition, some of the "Unicode" codes are reserved in case there is a need to include new characters in the future.

Now you know that in ASCII, each character is represented as a combination of 8 zeros and ones. To non-specialists, this information may seem unnecessary and uninteresting, but don't you want to know what is happening “in the brains” of your PC ?!

The set of characters with which text is written is called alphabet.

The number of characters in the alphabet is his power.

Formula for determining the amount of information: N = 2 b,

where N is the cardinality of the alphabet (number of characters),

b - number of bits (informational weight of the character).

The alphabet with a capacity of 256 characters can accommodate almost all the necessary characters. This alphabet is called sufficient.

Because 256 = 2 8, then the weight of 1 character is 8 bits.

The 8-bit unit was named 1 byte:

1 byte = 8 bits.

The binary code of each character in computer text takes up 1 byte of memory.

How is text information represented in the computer memory?

The convenience of byte encoding of characters is obvious, since a byte is the smallest addressable part of memory and, therefore, the processor can access each character separately, performing text processing. On the other hand, 256 characters is quite a sufficient number to represent a wide variety of character information.

Now the question arises, what kind of eight-bit binary code to associate with each character.

It is clear that this is a conditional matter, you can come up with many encoding methods.

All characters of the computer alphabet are numbered from 0 to 255. Each number corresponds to an eight-digit binary code from 00000000 to 11111111. This code is simply the ordinal number of the character in the binary system.

The table in which all the characters of the computer alphabet are assigned serial numbers is called the encoding table.

Different coding tables are used for different types of computers.

The international standard for the PC has become the table ASCII(read asci) (American Standard Code for Information Interchange).

The ASCII table is divided into two parts.

The international standard is only the first half of the table, i.e. symbols with numbers from 0 (00000000), up to 127 (01111111).

ASCII encoding table structure

Serial number	The code	Symbol
0 - 31	00000000 - 00011111	Symbols with numbers from 0 to 31 are usually called control characters. Their function is to control the process of displaying text on the screen or printing, giving a sound signal, marking text, etc.
32 - 127	00100000 - 01111111	Standard part of the table (English). This includes lowercase and uppercase letters of the Latin alphabet, decimal digits, punctuation marks, all kinds of brackets, commercial and other symbols. Character 32 is a space, i.e. empty position in the text. All others are reflected in certain signs.
128 - 255	10000000 - 11111111	Alternative part of the table (Russian). The second half of the ASCII code table, called the code page (128 codes, starting from 10000000 and ending with 11111111), can have different variants, each variant has its own number. The code page is primarily used to accommodate national alphabets other than Latin. In Russian national encodings, this part of the table contains symbols of the Russian alphabet.

The first half of the ASCII table

I draw your attention to the fact that in the encoding table, letters (uppercase and lowercase) are arranged in alphabetical order, and numbers are ordered in ascending order of values. This observance of the lexicographic order in the arrangement of characters is called the principle of sequential coding of the alphabet.

For the letters of the Russian alphabet, the principle of sequential coding is also observed.

The second half of the ASCII table

Unfortunately, there are currently five different Cyrillic encodings (KOI8-R, Windows. MS-DOS, Macintosh and ISO). Because of this, problems often arise with the transfer of Russian text from one computer to another, from one software system to another.

Chronologically, one of the first standards for encoding Russian letters on computers was KOI8 ("Information exchange code, 8-bit"). This encoding was used back in the 70s on computers of the ES computer series, and from the mid 80s it began to be used in the first Russified versions of the UNIX operating system.

From the beginning of the 90s, the time of the dominance of the MS DOS operating system, the CP866 encoding remains ("CP" stands for "Code Page").

Apple computers running Mac OS use their own Mac encoding.

In addition, the International Organization for Standardization (International Standards Organization, ISO) approved another encoding called ISO 8859-5 as a standard for the Russian language.

Currently, the most common encoding is Microsoft Windows, abbreviated as CP1251.

Since the late 90s, the problem of character coding standardization has been solved by the introduction of a new international standard called Unicode... This is a 16-bit encoding i.e. it allocates 2 bytes of memory for each character. Of course, this doubles the amount of memory used. But on the other hand, such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Let's try to use an ASCII table to imagine how words will look in computer memory.

Internal representation of words in computer memory

Sometimes it happens that a text consisting of letters of the Russian alphabet, received from another computer, cannot be read - some kind of "gibberish" is visible on the monitor screen. This is due to the fact that computers use different encoding of the characters of the Russian language.

Overlay symbols

The BS (backspace) character allows the printer to overwrite one character. In ASCII, it was provided for the addition of diacritics to letters in this way, for example:

a BS "→ á
a BS `→ à
a BS ^ → â
o BS / → ø
c BS, → ç
n BS ~ → ñ

Note: in old fonts, the apostrophe "was drawn with a slant to the left, and the tilde ~ was shifted up, so that they just fit the role of acute and tilde on top.

If the same symbol is superimposed on a character, then the effect of a bold font is obtained, and if an underscore is superimposed on the character, then underlined text is obtained.

a BS a → a
a BS _ → a

Note: this is used, for example, in the man help system.

National ASCII variants

The ISO 646 (ECMA-6) standard provides for the possibility of placing national characters in place @ [ \ ] ^ ` { | } ~ ... In addition to this, in place # can be accommodated £ , and in place $ - ¤ ... This system is well suited for European languages where only a few extra characters are needed. The ASCII version without national characters is called US-ASCII, or "International Reference Version".

Subsequently, it turned out to be more convenient to use 8-bit encodings (code pages), where the lower half of the code table (0-127) is occupied by US-ASCII characters, and the upper half (128-255) is occupied by additional characters, including a set of national characters. Thus, the upper half of the ASCII table, before the widespread adoption of Unicode, was actively used to represent localized characters, letters of the local language. The lack of a unified standard for placing Cyrillic characters in the ASCII table caused many problems with encodings (KOI-8, Windows-1251, and others). Other languages with a non-Latin script also suffered from the presence of several different encodings.

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
0.	NUL	SOM	EOA	EOM	EQT	WRU	RU	BELL	BKSP	Ht	LF	VT	FF	CR	SO	SI
1.	DC 0	DC 1	DC 2	DC 3	DC 4	ERR	SYNC	LEM	S 0	S 1	S 2	S 3	S 4	S 5	S 6	S 7
2.
3.
4.	BLANK	!	"	#	$	%	&	"	(	)	*	+	,	-	.	/
5.	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
6.
7.
8.
9.
A.	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
B.	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]		←
C.
D.
E.		a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
F.	p	q	r	s	t	u	v	w	x	y	z				ESC	DEL

On those computers where the minimum addressable unit of memory was a 36-bit word, 6-bit characters were initially used (1 word = 6 characters). After the transition to ASCII on such computers, they began to place either 5 seven-bit characters in one word (1 bit remained superfluous), or 4 nine-bit characters.

ASCII codes are also used to identify the pressed key during programming. For a standard QWERTY keyboard, the code table looks like this:

[8-bit encodings: ASCII, KOI-8R and CP1251] The first charset tables created in the USA did not use the eighth bit in a byte. The text was presented as a sequence of bytes, but the eighth bit was not taken into account (it was used for service purposes).

The generally accepted standard has become the table ASCII(American Standard Code for Information Interchange). The first 32 ASCII characters (00 to 1F) were used for non-printable characters. They were designed to control a printing device and the like. The rest - from 20 to 7F - are normal (printable) characters.

Table 1 - ASCII encoding

Dec	Hex	Oct	Char	Description
0	0	000		null
1	1	001		start of heading
2	2	002		start of text
3	3	003		end of text
4	4	004		end of transmission
5	5	005		inquiry
6	6	006		acknowledge
7	7	007		bell
8	8	010		backspace
9	9	011		horizontal tab
10	A	012		new line
11	B	013		vertical tab
12	C	014		new page
13	D	015		carriage return
14	E	016		shift out
15	F	017		shift in
16	10	020		data link escape
17	11	021		device control 1
18	12	022		device control 2
19	13	023		device control 3
20	14	024		device control 4
21	15	025		negative acknowledge
22	16	026		synchronous idle
23	17	027		end of trans. block
24	18	030		cancel
25	19	031		end of medium
26	1A	032		substitute
27	1B	033		escape
28	1C	034		file separator
29	1D	035		group separator
30	1E	036		record separator
31	1F	037		unit separator
32	20	040		space
33	21	041	!
34	22	042	"
35	23	043	#
36	24	044	$
37	25	045	%
38	26	046	&
39	27	047	"
40	28	050	(
41	29	051	)
42	2A	052	*
43	2B	053	+
44	2C	054	,
45	2D	055	-
46	2E	056	.
47	2F	057	/
48	30	060	0
49	31	061	1
50	32	062	2
51	33	063	3
52	34	064	4
53	35	065	5
54	36	066	6
55	37	067	7
56	38	070	8
57	39	071	9
58	3A	072	:
59	3B	073	;
60	3C	074	<
61	3D	075	=
62	3E	076	>
63	3F	077	?

Dec	Hex	Oct	Char
64	40	100	@
65	41	101	A
66	42	102	B
67	43	103	C
68	44	104	D
69	45	105	E
70	46	106	F
71	47	107	G
72	48	110	H
73	49	111	I
74	4A	112	J
75	4B	113	K
76	4C	114	L
77	4D	115	M
78	4E	116	N
79	4F	117	O
80	50	120	P
81	51	121	Q
82	52	122	R
83	53	123	S
84	54	124	T
85	55	125	U
86	56	126	V
87	57	127	W
88	58	130	X
89	59	131	Y
90	5A	132	Z
91	5B	133	[
92	5C	134	\
93	5D	135	]
94	5E	136	^
95	5F	137	_
96	60	140	`
97	61	141	a
98	62	142	b
99	63	143	c
100	64	144	d
101	65	145	e
102	66	146	f
103	67	147	g
104	68	150	h
105	69	151	i
106	6A	152	j
107	6B	153	k
108	6C	154	l
109	6D	155	m
110	6E	156	n
111	6F	157	o
112	70	160	p
113	71	161	q
114	72	162	r
115	73	163	s
116	74	164	t
117	75	165	u
118	76	166	v
119	77	167	w
120	78	170	x
121	79	171	y
122	7A	172	z
123	7B	173	{
124	7C	174	\|
125	7D	175	}
126	7E	176	~
127	7F	177	DEL

As you can easily see, this encoding contains only Latin letters, and those that are used in English. There are also arithmetic and other service symbols. But there are no Russian letters, or even special Latin letters for German or French. This is easy to explain - the encoding was developed specifically as an American standard. When computers began to be used all over the world, it became necessary to encode other symbols.

For this, it was decided to use the eighth bit in each byte. Thus, 128 more values were available (from 80 to FF), which could be used to encode characters. The first of the eight-bit tables is "extended ASCII" ( Extended ASCII) - included various variants of Latin characters used in some languages of Western Europe. It also contained other additional symbols, including pseudo graphics.

Pseudo-graphic characters allow, by displaying only text characters, to provide some semblance of graphics. For example, the program for managing files FAR Manager works with the help of pseudo-graphics.

There were no Russian letters in the Extended ASCII table. In Russia (formerly the USSR) and in other states, their own encodings were created, which made it possible to represent specific “national” characters in 8-bit text files - Latin letters of the Polish and Czech languages, Cyrillic (including Russian letters) and other alphabets.

In all encodings that have become widespread, the first 127 characters (that is, the byte values with the eighth bit equal to 0) coincide with ASCII. Thus, an ASCII file works in any of these encodings; the letters of the English language are represented in the same way.

Organization ISO(International Standardization Organization) adopted a group of standards ISO 8859... It defines 8-bit encodings for different groups of languages. So, ISO 8859-1 is Extended ASCII, a table for the United States and Western Europe. And ISO 8859-5 is a table for Cyrillic (including Russian).

However, for historical reasons, the ISO 8859-5 encoding has not caught on. In reality, the following encodings are used for the Russian language:

Code Page 866 ( CP866), aka “DOS”, aka “alternative GOST encoding”. It was widely used until the mid-90s; is now used to a limited extent. Practically not used for distributing texts on the Internet.
- KOI-8. Developed in the 70s and 80s. It is a generally accepted standard for the transmission of mail messages on the Russian Internet. It is also widely used in operating systems of the Unix family, including Linux. The KOI-8 version, designed for the Russian language, is called KOI-8R; there are versions for other Cyrillic languages (for example, KOI8-U is an option for the Ukrainian language).
- Code Page 1251, CP1251, Windows-1251. Developed by Microsoft to support the Russian language in Windows.

The main advantage of the CP866 was the preservation of pseudo-graphic characters in the same places as in Extended ASCII; therefore, foreign text programs, for example, the famous Norton Commander, could work without changes. Nowadays CP866 is used for Windows programs running in text windows or in full screen text mode, including FAR Manager.

In recent years, texts in CP866 are rather rare (but it is used to encode Russian filenames in Windows). Therefore, we will dwell in more detail on two other encodings - KOI-8R and CP1251.

As you can see, in the CP1251 encoding table, Russian letters are arranged in alphabetical order (except, however, the letter E). This arrangement makes it very easy for computer programs to sort alphabetically.

But in KOI-8R, the order of Russian letters seems to be random. But actually it is not.

Many older programs lost the 8th bit when processing or transmitting text. (Now such programs have practically "died out", but in the late 80s - early 90s they were widespread). To get a 7-bit value from an 8-bit value, subtract 8 from the most significant digit; for example E1 becomes 61.

Now compare the KOI-8R with the ASCII table (Table 1). You will find that the Russian letters are clearly aligned with the Latin ones. If the eighth bit disappears, lowercase Russian letters turn into uppercase Latin letters, and uppercase Russian letters turn into lowercase Latin ones. So, E1 in KOI-8 is Russian “A”, while 61 in ASCII is Latin “a”.

So, KOI-8 allows you to preserve the readability of the Russian text while losing the 8th bit. “Hello everyone” becomes “pRIWET WSEM”.

Recently, both the alphabetical order of the characters in the encoding table and the readability with the loss of the 8th bit have lost their decisive importance. The eighth bit in modern computers is not lost either during transmission or processing. Sorting in alphabetical order is based on encoding, and not just by comparing codes. (By the way, the CP1251 codes are not completely alphabetical - the letter E is not in its place).

Due to the fact that there are two common encodings, when working with the Internet (mail, browsing Web sites), you can sometimes see a meaningless set of letters instead of Russian text. For example, “I’m SBUFEMHEL”. These are just the words "with respect"; but they were encoded in CP1251 encoding, and the computer decoded the text according to the KOI-8 table. If the same words were, on the contrary, encoded in KOI-8, and the computer decoded the text according to table CP1251, the result will be “У ХЧБЦЕОЙЕН”.

Sometimes it happens that the computer decrypts Russian-language letters at all according to a table that is not intended for the Russian language. Then, instead of Russian letters, a meaningless set of symbols appears (for example, Latin letters of Eastern European languages); they are often called "crocozyabras".

In most cases, modern programs cope with determining the encodings of Internet documents (emails and Web pages) on their own. But sometimes they "misfire", and then you can see strange sequences of Russian letters or "krokozyabra". As a rule, in order to display real text on the screen, it is enough to select the encoding manually in the program menu.

For the article, the information from the page http://open-office.edusite.ru/TextProcessor/p5aa1.html was used.

Material taken from the site:

International ascii code table. ASCII (American standard code for information interchange) encoding - basic text encoding for Latin

Display and storage of information in a computer

How it all began

Transition to multilingualism

Description

Some properties

"Unicode"

Relationship with American standard code for information interchange

Presentation of information in a computer

How is coding done

ASCII character encoding

Control characters

Where is ASCII encoding used?

Some features

Using ASCII in Microsoft Office

Standardization

"Unicode"

How is text information represented in the computer memory?

Now the question arises, what kind of eight-bit binary code to associate with each character.

The table in which all the characters of the computer alphabet are assigned serial numbers is called the encoding table.

ASCII encoding table structure

Serial number

The code

Symbol

0 - 31

00000000 - 00011111

32 - 127

00100000 - 01111111

128 - 255

10000000 - 11111111

The first half of the ASCII table

The second half of the ASCII table

Let's try to use an ASCII table to imagine how words will look in computer memory.

Internal representation of words in computer memory

Overlay symbols

National ASCII variants

Top related articles