How to set up smartphones and PCs. Informational portal
  • home
  • Windows phone
  • How many characters are in the unicode table. Unicode on the Web: An Introduction for Beginners

How many characters are in the unicode table. Unicode on the Web: An Introduction for Beginners

Unicode or Unicode (from the English word Unicode) is a character encoding standard. It enables almost all written languages ​​to be encoded.

In the late 1980s, the role of the standard was assigned to 8-bit characters. 8-bit encodings were represented by various modifications, the number of which was constantly growing. This was mainly the result of an active expansion of the range of languages ​​used. There was also a desire of developers to come up with an encoding that claims at least partial universality.

As a result, it became necessary to deal with several problems:

  • problems with displaying documents in incorrect encoding. It could be solved either by consistently introducing methods for specifying the encoding used, or by introducing a single encoding for everyone;
  • problems of character pack limitations, solved either by switching fonts in the document, or by introducing an extended encoding;
  • the problem of converting an encoding from one to another, which it seemed possible to solve either using an intermediate transformation (third encoding) that includes characters of different encodings, or by compiling conversion tables for each two encodings;
  • problems of duplication of individual fonts. Traditionally, each encoding was assumed to have its own font, even when the encodings completely or partially matched in the character set. To some extent, the problem was solved with the help of "large" fonts, from which the characters needed for a particular encoding were then selected. But in order to determine the degree of compliance, it was necessary to create a single register of symbols.

Thus, the question of the need to create a "wide" unified encoding was on the agenda. Variable character length encodings used in Southeast Asia seemed too difficult to apply. Therefore, emphasis was placed on using a character that has a fixed width. 32-bit characters seemed too cumbersome and 16-bit ones won in the end.

The standard was proposed to the Internet community in 1991 by a non-profit organization "Unicode Consortium". Its use makes it possible to encode a large number of characters of different types of writing. In Unicode documents, neither Chinese characters, nor mathematical symbols, nor Cyrillic, nor Latin are close in close proximity. At the same time, code pages do not require any switching during operation.

The standard consists of two main sections: the universal character set (English UCS) and the family of encodings (in the English interpretation - UTF). The universal character set defines an unambiguous proportionality to the character codes. Codes in this case are elements of the code sphere, which are non-negative integers. The function of an encoding family is to define the machine representation of a sequence of UCS codes.

In the Unicode Standard, codes are graded in several areas. Area with codes beginning with U+0000 and ending with U+007F - includes the characters of the ASCII set with the required codes. Further there are areas of symbols of different scripts, technical symbols, punctuation marks. A separate batch of codes is kept in reserve for future use. The following character areas with codes are defined for Cyrillic: U+0400 - U+052F, U+2DE0 - U+2DFF, U+A640 - U+A69F.

The value of this encoding in the web space is growing inexorably. The share of sites using Unicode was almost 50 percent at the beginning of 2010.

Today we will talk with you about where krakozyabrs come from on the site and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from the basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251, and ending with the modern encodings of the Unicode Consortium UTF 16 and 8. Contents:

  • Extended versions of Asuka - CP866 and KOI8-R encodings
  • Windows 1251 - a variation of ASCII and why bugs come out
To some, this information may seem redundant, but you would know how many questions I get specifically with regards to crawled out krakozyabrs (an unreadable character set). Now I will have the opportunity to refer everyone to the text of this article and independently look for my jambs. Well, get ready to absorb the information and try to follow the course of the story.

ASCII - basic text encoding for Latin

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals, and punctuation marks with control characters. But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as "aski"). It describes the first 128 characters most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks. Even in these 128 characters described in ASCII, there were some service characters like brackets, bars, asterisks, etc. Actually, you can see them yourself:
It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely meet them and they will stand in that order. But the fact is that with the help of one byte of information, it is possible to encode not 128, but as many as 256 different values ​​​​(two to the power of eight equals 256), so after the basic version of Asuka, a whole series of extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian). Here, probably, it is worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone studied at an institute or at school). One byte consists of eight bits, each of which is a two to the power of two, starting from zero, and up to two in the seventh:
It is not difficult to understand that there can be only 256 of all possible combinations of zeros and ones in such a construction. Converting a number from binary to decimal is quite simple. You just need to add up all the powers of two, over which there are ones. In our example, this is 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth), plus 64 (to the sixth), plus 128 (to the seventh). Total gets 233 in decimal notation. As you can see, everything is very simple. But if you take a closer look at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds in Asci to the hexadecimal number 2A. You probably know that in addition to Arabic numerals, the hexadecimal number system also uses Latin letters from A (meaning ten) to F (meaning fifteen). Well, for convert binary to hexadecimal resort to the following simple and visual method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte, only sixteen values ​​\u200b\u200bcan be encoded in binary code (two to the fourth power), which can be easily represented as a hexadecimal number. Moreover, in the left half of the byte, it will be necessary to count the degrees again, starting from zero, and not as shown in the screenshot. As a result, by simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle turned out to be clear to you. Well, now let's continue, in fact, to talk about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8). Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. it became possible to add characters of the letters of your language to Asci. Here it will be necessary to digress again to explain - why do you need text encodings at all and why is it so important. Symbols on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are located in files with fonts that are installed on your computer) and a code that allows you to pull out exactly that one from this set of vector shapes (font file). character to be inserted at the correct location. It is clear that fonts are responsible for the vector forms themselves, but the operating system and the programs used in it are responsible for encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text. The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the desired font file that is connected to display this text document. Everything is simple and banal. This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska. For example, initially there was CP866, in which it was possible to use the characters of the Russian alphabet and it was an extended version of ASCII. Those. its upper part completely coincided with the basic version of Asuka (128 Latin characters, numbers and other crap), which is shown in the screenshot just above, but the lower part of the table with CP866 encoding had the form shown in the screenshot just below and allowed to encode another 128 signs (Russian letters and all kinds of pseudographics there):
You see, in the right column, the numbers start with 8, because numbers from 0 to 7 refer to the ASCII base part (see the first screenshot). That. the Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and the column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will be displayed in the text. Where did this amount come from? pseudographics in CP866? The thing is that this encoding for Russian text was developed back in those furry years, when there was no such distribution of graphical operating systems as it is now. And in Dosa, and similar text operating systems, pseudo-graphics made it possible to somehow diversify the design of texts, and therefore it abounds in CP866 and all its other peers from the category of extended versions of Asuka. CP866 was distributed by IBM, but besides this, a number of encodings were developed for Russian characters, for example, the same type (extended ASCII) can be attributed KOI8-R:
The principle of its operation remains the same as that of the CP866 described a little earlier - each character of the text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half fully corresponds to the basic Asuka, which is shown in the first screenshot in this article. Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, was done in CP866. If you look at the very first screenshot (of the base part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the letters of the Latin alphabet consonant with them from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding only one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why krakozyabry crawl out

Further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose, which, in essence, were still extended versions of Asuka (one text character is encoded with just one byte of information), but without the use of pseudographic characters. They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the variant with support for the Russian language. An example of such can be Windows 1251. It compares favorably with the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (apart from the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ):
Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had a headache, and we, dear readers, often got out the very notorious krakozyabry when there was confusion with the version used in the text. Very often they got out when sending and receiving messages by e-mail, which led to the creation of very complex conversion tables, which, in fact, could not solve this problem in the root, and often users used transliteration of Latin letters for correspondence in order to avoid the notorious krakozyabry when using Russian encodings like CP866, KOI8-R or Windows 1251. In fact, the bugs that appeared instead of Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which the text message was originally encoded. For example, if you try to display the characters encoded using the CP866 using the Windows 1251 code table, then these same krakozyabry (meaningless character set) will come out, completely replacing the message text. A similar situation very often occurs when creating and configuring sites, forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor that adds invisible gag to the code naked eye. In the end, many people got tired of such a situation with a lot of encodings and constantly getting out krakozyabry, there were prerequisites for creating a new universal variation that would replace all existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where the characters of the language were much more than 256.

Unicode (Unicode) - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not be described in any way in one byte of information, which was allocated for encoding characters in extended versions of ASCII. As a result, a consortium called Unicode(Unicode - Unicode Consortium) with the cooperation of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding. The first variation to be released under the auspices of the Unicode Consortium was UTF-32. The number in the name of the encoding means the number of bits that is used to encode one character. 32 bits is 4 bytes of information that will be needed to encode one single character in the new universal encoding UTF. As a result, the same file with text, encoded in the extended version of ASCII and in UTF-32, in the latter case will have a size (weight) four times larger. This is bad, but now we have the opportunity to encode using UTF the number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a huge margin). But many countries with languages ​​​​of the European group did not need to use such a huge number of characters in the encoding at all, however, when using UTF-32, they would get a fourfold increase in the weight of text documents for nothing, and as a result, an increase in the volume of Internet traffic and volume stored data. This is a lot, and no one could afford such waste. As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was accepted as the default base space for all the characters that we use. It uses two bytes to encode one character. Let's see what this thing looks like. In the Windows operating system, you can go along the path "Start" - "Programs" - "Accessories" - "Utilities" - "Character Table". As a result, a table with vector shapes of all fonts installed in your system will open. If you select the Unicode character set in the "Advanced Options", you can see for each font individually the entire range of characters included in it. By the way, by clicking on any of them, you can see its double-byte code in UTF-16 format, consisting of four hexadecimal digits: How many characters can be encoded in UTF-16 using 16 bits? 65536 (two to the power of sixteen), and it was this number that was adopted as the base space in Unicode. In addition, there are ways to encode with it about two million characters, but limited to an extended space of a million characters of text. But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per one character in Asci and two bytes per same character in UTF-16). That's it for the satisfaction of everyone and everything in the Unicode Consortium it was decided come up with an encoding variable length. It's called UTF-8. Despite the eight in the name, it really has a variable length, i.e. each text character can be encoded into a sequence of one to six bytes. In practice, in UTF-8, only the range from one to four bytes is used, because behind four bytes of code, nothing is even theoretically possible to imagine. All Latin characters in it are encoded in one byte, just like in the good old ASCII. Remarkably, in the case of encoding only Latin, even those programs that do not understand Unicode will still read what is encoded in UTF-8. Those. the base part of Asuka simply passed into this brainchild of the Unicode Consortium. Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. In the “Character Table” above, you can see that different fonts support a different number of characters. Some Unicode-rich fonts can be very large. But now they differ not in that they were created for different encodings, but in the fact that the font manufacturer filled or did not fill the single code space with one or another vector form to the end.

Krakozyabry instead of Russian letters - how to fix

Let's now see how krakozyabras appear instead of text, or, in other words, how the correct encoding for Russian text is chosen. Actually, it is set in the program in which you create or edit this same text, or code using text fragments. For editing and creating text files, I personally use a very good, in my opinion, Html and PHP editor Notepad ++. However, it can highlight the syntax of a good hundred more programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link below. In the top menu of Notepad ++ there is an item "Encodings", where you will have the opportunity to convert an existing option to the one used on your site by default:
In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, in order to avoid the appearance of bugs, choose the option UTF8 without BOM. What is the prefix BOM? The fact is that when they developed the UTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write a character code, both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand in which sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in the addition of three additional bytes to the very beginning of the documents. In UTF-8 encoding, no BOM was provided for in the Unicode consortium, and therefore adding a signature (these most notorious additional three bytes to the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always choose the option without BOM (without signature). So you advance protect yourself from crawling krakozyabry. Remarkably, some programs in Windows do not know how to do this (they cannot save text in UTF-8 without BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on the servers, because of this little thing, a problem may arise - krakozyabry will come out. Therefore, by no means do not use regular Windows notepad for editing documents of your site, if you do not want the appearance of krakozyabrov. I consider the already mentioned Notepad ++ editor to be the best and simplest option, which has practically no drawbacks and consists of only advantages. In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is inherently very close to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described a little above. Where does this information come from? It is written in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which one to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language. After you save the document in Notepad ++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor: To avoid krakozyabrov, in addition to the actions described above, it will be useful to write information about this encoding in its header of the source code of all pages of the site so that there is no confusion on the server or local host. In general, in all hypertext markup languages ​​except Html, a special xml declaration is used, which specifies the text encoding.< ? xml version= "1.0" encoding= "windows-1251" ? >Before parsing the code, the browser knows which version is being used and how exactly the character codes of that language should be interpreted. But what is noteworthy, if you save the document in the default unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM). In the case of an Html language document, the encoding is used to specify Meta element, which is written between the opening and closing Head tag: < head> . . . < meta charset= "utf-8" > . . . < / head>This entry is quite different from the standard in Html 4.01, but is fully consistent with the new Html 5 standard that is slowly being introduced, and it will be 100% correctly understood by any browsers currently in use. In theory, the Meta element with the Html encoding of the document would be better to put as high as possible in the header of the document so that at the time of the meeting in the text of the first character not from the base ANSI (which will always be read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters. Link to first

Unicode is a very large and complex world, because the standard allows you to represent and work on a computer with all the major scripts of the world. Some writing systems have been around for more than a thousand years, and many of them have developed almost independently from each other in different parts of the world. People have come up with so many things and it is often so different from each other that it was an extremely difficult and ambitious task to combine all this into a single standard.

To really understand Unicode, you need to at least superficially imagine the features of all the scripts that the standard allows you to work with. But is it really necessary for every developer? We'll say no. To use Unicode in most everyday tasks, it is enough to have a reasonable minimum of knowledge, and then delve into the standard as needed.

In this article, we will talk about the basic principles of Unicode and highlight those important practical issues that developers will certainly face in their daily work.

Why is Unicode needed?

Before the advent of Unicode, single-byte encodings were almost universally used, in which the boundary between the characters themselves, their representation in computer memory and display on the screen was rather conditional. If you worked with one or another national language, then the corresponding encoding fonts were installed on your system, which allowed you to draw bytes from the disk on the screen in such a way that they make sense to the user.

If you printed a text file on a printer and saw a set of incomprehensible krakozyabr on a paper page, this meant that the appropriate fonts were not loaded into the printer and it interprets the bytes not the way you would like.

This approach in general and single-byte encodings in particular had a number of significant drawbacks:

  1. It was possible to work simultaneously with only 256 characters, and the first 128 were reserved for Latin and control characters, and in the second half, in addition to the characters of the national alphabet, it was necessary to find a place for pseudographic characters (╔ ╗).
  2. The fonts were tied to a specific encoding.
  3. Each encoding represented its own set of characters, and conversion from one to another was possible only with partial losses, when missing characters were replaced with graphically similar ones.
  4. Transferring files between devices running different operating systems was difficult. It was necessary either to have a converter program, or to carry additional fonts along with the file. The existence of the Internet as we know it was impossible.
  5. There are non-alphabetic writing systems in the world (hieroglyphic writing), which are in principle unrepresentable in a single-byte encoding.

Fundamentals of Unicode

We all understand perfectly well that the computer does not know about any ideal entities, but operates with bits and bytes. But computer systems are still created by people, not machines, and it is sometimes more convenient for you and me to operate with speculative concepts, and then move from the abstract to the concrete.

Important! One of the central principles in the philosophy of Unicode is a clear distinction between characters, their representation on a computer, and their display on an output device.

The concept of an abstract Unicode character is introduced, which exists only in the form of a speculative concept and agreement between people, fixed by the standard. Each Unicode character is assigned a non-negative integer called its code point.

So, for example, the Unicode character U+041F is the capital Cyrillic letter P. There are several ways to represent this character in the computer's memory, just like several thousand ways to display it on the monitor screen. But at the same time P, it will be P or U + 041F in Africa.

This is the well-known encapsulation or separation of the interface from the implementation - a concept that has proven itself in programming.

It turns out that, guided by the standard, any text can be encoded as a sequence of Unicode characters

Hello U+041F U+0440 U+0438 U+0432 U+0435 U+0442

write it down on a piece of paper, pack it in an envelope and send it to any part of the Earth. If they know about the existence of Unicode, then the text will be perceived by them in exactly the same way as we do with you. They will not have the slightest doubt that the penultimate character is precisely the Cyrillic lowercase e(U+0435) rather than say latin small e(U+0065). Notice that we didn't say a word about byte representation.

Unicode code space

The Unicode code space consists of 1,114,112 code points ranging from 0 to 10FFFF. Of these, only 128,237 have been assigned values ​​by the ninth version of the standard. Part of the space is reserved for private use, and the Unicode Consortium promises never to assign values ​​to positions from these special areas.

For the sake of convenience, the entire space is divided into 17 planes (six of them are currently involved). Until recently, it was customary to say that most likely you will only have to deal with the Basic Multilingual Plane (BMP), which includes Unicode characters from U+0000 to U+FFFF. (Going ahead a little: characters from BMP are represented in UTF-16 by two bytes, not four). In 2016, this thesis is already in doubt. So, for example, popular Emoji characters may well be found in a user message and you need to be able to process them correctly.

Encodings

If we want to send text over the Internet, then we need to encode a sequence of Unicode characters as a sequence of bytes.

The Unicode Standard includes a description of a number of Unicode encodings, such as UTF-8 and UTF-16BE/UTF-16LE, which allow the entire space of code points to be encoded. Conversion between these encodings can be freely carried out without loss of information.

Also, no one has canceled single-byte encodings, but they allow you to encode your own individual and very narrow piece of the Unicode spectrum - 256 or less code positions. For such encodings, tables exist and are available to everyone, where each value of a single byte is associated with a Unicode character (see, for example, CP1251.TXT). Despite the limitations, single-byte encodings turn out to be very practical when it comes to working with a large array of monolingual textual information.

Of the Unicode encodings on the Internet, UTF-8 is the most common (it won the palm in 2008), mainly due to its economy and transparent compatibility with seven-bit ASCII. Latin and service symbols, basic punctuation marks and numbers - i.e. all seven-bit ASCII characters are encoded in UTF-8 with one byte, the same as in ASCII. The symbols of many basic scripts, apart from some rarer hieroglyphic characters, are represented in it by two or three bytes. The largest of the code positions defined by the standard - 10FFFF - is encoded with four bytes.

Note that UTF-8 is a variable length encoding. Each Unicode character in it is represented by a sequence of code quanta with a minimum length of one quanta. The number 8 means the bit length of the code quantum (code unit) - 8 bits. For the UTF-16 encoding family, the size of the code quantum is, respectively, 16 bits. For UTF-32 - 32 bits.

If you are sending an HTML page with Cyrillic text over the network, then UTF-8 can give a very tangible gain, because. all markup, as well as JavaScript and CSS blocks, will effectively be encoded in one byte. For example, the main page of Habr in UTF-8 takes 139Kb, and in UTF-16 it is already 256Kb. For comparison, if you use win-1251 with the loss of the ability to save some characters, then the size will be reduced by only 11Kb.

Applications often use 16-bit Unicode encodings to store string information due to their simplicity and the fact that the characters of the world's major writing systems are encoded in one sixteen-bit quantum. So, for example, Java successfully uses UTF-16 for internal representation of strings. The Windows operating system internally also uses UTF-16.

In any case, as long as we stay in Unicode space, it doesn't really matter how string information is stored within a single application. If the internal storage format allows you to correctly encode all more than a million code positions and there is no loss of information at the application boundary, for example, when reading from a file or copying to the clipboard, then everything is fine.

To correctly interpret text read from a disk or network socket, you must first determine its encoding. This is done either using user-provided meta-information written in or next to the text, or determined heuristically.

In the dry matter

There is a lot of information and it makes sense to give a brief summary of everything that was written above:

  • Unicode postulates a clear distinction between characters, their representation on a computer, and their display on an output device.
  • The Unicode code space consists of 1,114,112 code points ranging from 0 to 10FFFF.
  • The Basic Multilingual Plane includes the Unicode characters U+0000 through U+FFFF, which are encoded in UTF-16 as two bytes.
  • Any Unicode encoding allows you to encode the entire space of Unicode code positions, and conversion between various such encodings is carried out without loss of information.
  • Single-byte encodings encode only a small part of the Unicode spectrum, but can be useful when working with a large amount of monolingual information.
  • UTF-8 and UTF-16 encodings have variable code length. In UTF-8, each Unicode character can be encoded as one, two, three, or four bytes. In UTF-16, two or four bytes.
  • The internal format for storing textual information within a separate application can be arbitrary, provided that it works correctly with the entire space of Unicode code positions and there are no losses during cross-border data transfer.

A quick note about coding

There can be some confusion with the term encoding. Within Unicode, encoding occurs twice. The first time a Unicode character set (character set) is encoded, in the sense that each Unicode character is assigned a corresponding code position. As part of this process, the Unicode character set is converted into a coded character set. The second time the unicode character sequence is converted to a byte string, this process is also called encoding.

In English terminology, there are two different verbs to code and to encode, but even native speakers often get confused in them. In addition, the term character set or charset is used as a synonym for the term coded character set.

We say all this to the fact that it makes sense to pay attention to the context and distinguish between situations when it comes to the code position of an abstract Unicode character and when it comes to its byte representation.

Finally

There are so many different aspects of Unicode that it is impossible to cover everything in one article. Yes, and unnecessary. The above information is quite enough to not get confused in the basic principles and work with text in most everyday tasks (read: without going beyond BMP). In the following articles, we will talk about normalization, give a more complete historical overview of the development of encodings, talk about the problems of Russian-language Unicode terminology, and also make material on the practical aspects of using UTF-8 and UTF-16.

Unicode: UTF-8, UTF-16, UTF-32.

Unicode is a set of graphic characters and a way to encode them for computer processing of text data.

Unicode not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

    character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

    character attributes (left-to-right or right-to-left display, space, line break, etc.);

    corresponding capital or small letter (for lowercase and uppercase letters respectively);

    the corresponding numeric value (for numeric characters).

    Standards UTF(abbreviation for Unicode Transformation Format) to represent characters:

UTF-16: Windows Setup, Acceleration, Vista FAQ uses UTF-16 encoding to represent all Unicode characters. In UTF-16, characters are represented by two bytes (16 bits). This encoding is used in Windows because 16-bit values ​​can represent the characters that make up the alphabets of most languages ​​in the world, this allows programs to process strings faster and calculate their length. However, 16 bits is not enough to represent alphabetic characters in some languages. For such cases, UTE-16 supports "surrogate" encodings, allowing characters to be encoded in 32 bits (4 bytes). However, there are few applications that have to deal with the characters of such languages, so UTF-16 is a good compromise between saving memory and ease of programming. Note that in the .NET Framework, all characters are encoded using UTF-16, so using UTF-16 in Windows applications improves performance and reduces memory consumption when passing strings between native and managed code.

UTF-8: In UTF-8 encoding, different characters can be represented by 1,2,3 or 4 bytes. Characters with values ​​less than 0x0080 are compressed to 1 byte, which is very convenient for US characters. Characters that match values ​​in the range 0x0080-0x07FF are converted to 2-byte values, which works well with European and Middle Eastern alphabets. Characters with larger values ​​are converted to 3-byte values, which are convenient when working with Central Asian languages. Finally, "surrogate" pairs are written in 4-byte format. UTF-8 is an extremely popular encoding. However, it is less effective than UTF-16 if characters with values ​​of 0x0800 or higher are frequently used.

UTF-32: In UTF-32, all characters are represented by 4 bytes. This encoding is convenient for writing simple algorithms for enumerating characters of any language that do not require processing characters represented by different numbers of bytes. For example, when using UTF-32, you can forget about "surrogates", since any character in this encoding is represented by 4 bytes. Clearly, UTF-32 is far from ideal in terms of memory usage. Therefore, this encoding is rarely used to transfer strings over the network and save them to files. As a rule, UTF-32 is used as an internal format for representing data in a program.

UTF-8

In the near future, a special Unicode (and ISO 10646) format called UTF-8. This "derivative" encoding uses strings of bytes of various lengths (from one to six) to write characters, which are converted into Unicode codes using a simple algorithm, with shorter strings corresponding to more common characters. The main advantage of this format is compatibility with ASCII not only in terms of code values, but also in terms of the number of bits per character, since one byte is enough to encode any of the first 128 characters in UTF-8 (although, for example, Cyrillic letters need two bytes).

The UTF-8 format was invented on September 2, 1992 by Ken Thompson and Rob Pike and implemented in Plan 9. Now the UTF-8 standard is officially enshrined in RFC 3629 and ISO / IEC 10646 Annex D documents.

For the Web designer, this encoding is of particular importance, since it is declared the "standard document encoding" in HTML since version 4.

Text consisting only of characters less than 128 is converted to plain ASCII text when written in UTF-8. Conversely, in UTF-8 text, any byte with a value less than 128 represents an ASCII character with the same code. The remaining Unicode characters are represented as sequences of 2 to 6 bytes in length (actually only up to 4 bytes, since codes greater than 221 are not planned), in which the first byte is always 11xxxxxx, and the rest are 10xxxxxx.

Simply put, in UTF-8, Latin characters, punctuation, and ASCII control characters are written as US-ASCII codes, and all other characters are encoded using several octets with the most significant bit of 1. This has two effects.

    Even if the program does not recognize Unicode, then Latin letters, Arabic numerals and punctuation marks will be displayed correctly.

    In the event that Latin letters and the simplest punctuation marks (including space) occupy a significant amount of text, UTF-8 gives a gain in volume compared to UTF-16.

    At first glance, it may seem that UTF-16 is more convenient, since most characters in it are encoded in exactly two bytes. However, this is negated by the need to support surrogate pairs, which are often forgotten when using UTF-16, implementing only support for UCS-2 characters.

The standard was proposed in 1991 by the non-profit organization Unicode Consortium (English Unicode Consortium, Unicode Inc.). The use of this standard makes it possible to encode a very large number of characters from different scripts: Chinese characters, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic alphabets can coexist in Unicode documents, while switching code pages becomes unnecessary.

The standard consists of two main sections: the universal character set (UCS, universal character set) and the encoding family (UTF, Unicode transformation format). The universal character set specifies a one-to-one correspondence of characters to codes - elements of the code space representing non-negative integers. An encoding family defines the machine representation of a sequence of UCS codes.

The Unicode standard was developed with the goal of creating a single character encoding for all modern and many ancient written languages. Each character in this standard is encoded with 16 bits, which allows it to cover an incomparably larger number of characters than previously accepted 8-bit encodings. Another important difference between Unicode and other encoding systems is that it not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

Character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

Character attributes (display from left to right or right to left, space, line break, etc.);

The corresponding uppercase or lowercase letter (for lowercase and uppercase letters, respectively);

The corresponding numeric value (for numeric characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds either to the alphabet of some language, or to a group of special characters that are similar in their functions. The diagram below contains a general list of subsets of Unicode 3.0 (Figure 2).

Figure 2

The Unicode standard is the basis for storage and text in many modern computer systems. However, it is not compatible with most Internet protocols, because its codes can contain any byte value, and protocols usually use bytes 00 - 1F and FE - FF as service bytes. To achieve compatibility, several Unicode transformation formats (UTFs, Unicode Transformation Formats) have been developed, of which UTF-8 is the most common today. This format defines the following rules for converting each Unicode code into a set of bytes (from one to three) suitable for transport by Internet protocols.


Here x,y,z denote the bits of the source code, which should be extracted, starting from the youngest, and entered into the result bytes from right to left, until all specified positions are filled.

Further development of the Unicode standard is associated with the addition of new language planes, i.e. characters in the ranges 10000 - 1FFFF, 20000 - 2FFFF, etc., where it is supposed to include the encoding for the scripts of dead languages ​​that are not included in the table above. A new UTF-16 format was developed to encode these additional characters.

Thus, there are 4 main ways to encode bytes in Unicode format:

UTF-8: 128 characters encoded in one byte (ASCII format), 1920 characters encoded in 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters), 63488 characters encoded in 3 bytes (Chinese, Japanese etc.) The remaining 2,147,418,112 characters (not yet used) can be encoded with 4, 5, or 6 bytes.

UCS-2: Each character is represented by 2 bytes. This encoding includes only the first 65,535 characters from the Unicode format.

UTF-16: This is an extension of UCS-2 and includes 1,114,112 Unicode characters. The first 65,535 characters are represented by 2 bytes, the rest by 4 bytes.

USC-4: Each character is encoded with 4 bytes.

Top Related Articles