International unicode standard. Why did you need Unicode? A quick note on coding

15.04.2019 Interesting

Unicode is an international character encoding standard that allows text to be displayed consistently on any computer in the world, regardless of its system language.

The basics

To understand what the Unicode character table is for, let's first understand the mechanism for displaying text on a monitor screen. The computer, as we know, processes all information in digital form, and to display it for correct perception by a person should be in the graphic. Thus, in order for us to read this text, it is necessary to solve at least two tasks:

Encode printable characters into digital form.
Make the operating system matchable digital form with vector symbols, in other words, find the correct letters.

First encodings

American ASCII is considered to be the ancestor of all encodings. It described used in English language Latin alphabet with punctuation marks and Arabic numerals. It was the 128 characters used in it that became the basis for subsequent developments - even the modern Unicode character table uses them. Since then, the letters of the Latin alphabet have occupied the first positions in any encoding.

In total, ASCII allowed 256 characters to be stored, but since the first 128 were occupied by the Latin alphabet, the remaining 128 began to be used all over the world to create national standards. For example, in Russia, CP866 and KOI8-R were created on its basis. Such variations were called extended versions of ASCII.

Code pages and "krakozyabry"

Further development technologies and the emergence of a graphical interface led to the fact that the ANSI encoding was created by the American Standards Institute. Russian users, especially with experience, her version is known under Windows name 1251. It introduced the concept of a “code page” for the first time. It was with the help of code pages, which contained symbols of national alphabets other than Latin, that "mutual understanding" was established between computers used in different countries.

However, the presence a large number different encodings used for the same language started to cause problems. The so-called krakozyabry appeared. They arose from a mismatch between the original code page, in which any information was created, and the code page used by default on the end user's computer.

As an example, the above cyrillic encodings CP866 and KOI8-R. The letters in them differed in code positions and principles of placement. In the first, they were placed in alphabetical order, and in the second - in arbitrary. You can imagine what was happening before the eyes of a user who tried to open such a text without having the required code page or when it was misinterpreted by the computer.

Creation of Unicode

The proliferation of the Internet and related technologies such as Email, led to the fact that in the end the situation with the distortion of the texts ceased to suit everyone. Leading IT companies have formed the Unicode Consortium. The character table he introduced in 1991 under the name UTF-32 could store over a billion unique characters. It was crucial step on the way to decrypting texts.

However, the first universal Unicode table of character codes, UTF-32, was not widely adopted. The main reason was the redundancy of stored information. It was quickly calculated that for countries that use the Latin alphabet encoded with the new universal table, text would take up four times the space than when using the extended ASCII table.

Development of Unicode

The following Unicode UTF-16 character table has fixed this problem. Coding in it was carried out in half the number of bits, but at the same time the number of possible combinations also decreased. Instead of billions of characters, it only stores 65,536. Nevertheless, it was so successful that the Consortium decided that this number was defined as the basic storage space for Unicode characters.

Despite this success, UTF-16 did not suit everyone, since the amount of stored and transmitted information was still doubled. One-stop solution became UTF-8, a variable-length Unicode character table. This can be called a breakthrough in this area.

Thus, with the introduction of the last two standards, the Unicode character table has solved the problem of a single code space for all fonts in use today.

Unicode for Russian

Thanks to variable length of the code used to display characters, the Latin alphabet is encoded in Unicode in the same way as in its ancestor ASCII, that is, in one bit. For other alphabets, the picture may look different. For example, the characters of the Georgian alphabet use three bytes for encoding, and the characters of the Cyrillic alphabet use two. All this is possible within the framework of using the UTF-8 Unicode standard (character table). The Russian language or the Cyrillic alphabet occupies 448 positions in the total code space, divided into five blocks.

These five blocks include the basic Cyrillic and Church Slavonic alphabet, as well as additional letters other languages using Cyrillic. A number of positions are highlighted for displaying old forms of representation of Cyrillic letters, and 22 positions out of the total number are still free.

Current version of Unicode

With the solution of its primary task, which was to standardize fonts and create a single code space for them, the "Consortium" did not stop its work. Unicode is constantly evolving and expanding. The last current version of this standard, 9.0, was released in 2016. It included six additional alphabets and expanded the list of standardized emojis.

I must say that in order to simplify research, even the so-called dead languages are added to Unicode. They got this name because people for whom he would be native do not exist. This group also includes languages that have come down to our time only in the form of written monuments.

In principle, anyone can apply to add characters to the new Unicode specification. True, for this you will have to fill out a decent amount of source documents and spend a lot of time. A living example of this is the story of the programmer Terence Eden. In 2013, he filed for inclusion in the specification of symbols related to the designation of computer power control buttons. V technical documentation they have been around since the mid-1970s, but were not part of Unicode until the 9.0 specification.

table of symbols

Every computer, regardless of the operating system used, uses a Unicode character table. How to use these tables, where to find them, and why can they be useful to an ordinary user?

In OS Windows table symbols is located in the "Service" section of the menu. In the family of operating rooms Linux systems it can usually be found under the "Standard" subsection, and on MacOS under the keyboard preferences. The main purpose of this table is to enter into text documents characters that are not located on the keyboard.

The widest application for such tables can be found: from entering technical symbols and national icons monetary systems before writing instructions for practical application tarot cards.

Finally

Unicode is used everywhere and entered our life along with the development of the Internet and mobile technologies. Thanks to its use, the system of interethnic communications has been significantly simplified. We can say that the introduction of Unicode is an indicative, but completely invisible from the outside example of the use of technology for the common good of all mankind.

Unicode: UTF-8, UTF-16, UTF-32.

Unicode is a set graphic symbols and the way to code them for computer processing text data.

Unicode not only assigns to each character unique code but also defines various characteristics this symbol, for example:

character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

character attributes (left-to-right or right-to-left display, space, line break, etc.);

the corresponding uppercase or lowercase letter (for lowercase and uppercase letters respectively);

appropriate numerical value(for numeric characters).

Standards UTF(abbreviation for Unicode Transformation Format) to represent characters:

UTF-16: Windows tweak, speed up, Vista FAQs to introduce everyone Unicode characters UTF-16 encoding is used. In UTF-16, characters are represented by two bytes (16 bits). This encoding is used in Windows because 16-bit values can represent the characters that make up the alphabets of most languages in the world, this allows programs to process strings and calculate their length faster. However, 16-bit is not enough to represent alphabetical characters in some languages. For such cases, UTE-16 supports "surrogate" encodings, allowing characters to be encoded in 32 bits (4 bytes). However, there are few applications that have to deal with the characters of such languages, so UTF-16 is a good compromise between saving memory and ease of programming. Note that in the .NET Framework all characters are encoded using UTF-16, so using UTF-16 in Windows applications improves performance and reduces memory consumption when passing strings between native and managed code.

UTF-8: In UTF-8 encoding, different characters can be represented by 1,2,3 or 4 bytes. Characters with values less than 0x0080 are compressed to 1 byte, which is very convenient for US characters. Characters that match values in the range 0x0080-0x07FF are converted to 2-byte values, which works well with European and Middle Eastern alphabets. Characters with larger values are converted to 3-byte values, useful for working with Central Asian languages. Finally, surrogate pairs are written in 4-byte format. UTF-8 is an extremely popular encoding. However, it is less effective than UTF-16 if characters with values 0x0800 and higher are frequently used.

UTF-32: In UTF-32, all characters are represented by 4 bytes. This encoding is easy to write simple algorithms to iterate over characters of any language that do not require processing of characters represented by a different number of bytes. For example, when using UTF-32, you can forget about "surrogates", since any character in this encoding is represented by 4 bytes. Clearly, from a memory usage standpoint, UTF-32's efficiency is far from ideal. Therefore, this encoding is rarely used to transfer strings over the network and save them to files. Typically, UTF-32 is used as an internal format for presenting data in a program.

UTF-8

In the near future, more and more important role will play a special Unicode (and ISO 10646) format called UTF-8... This "derived" encoding uses strings of bytes of various lengths (from one to six) to write characters, which are converted to Unicode codes using a simple algorithm, with shorter strings corresponding to more common characters. The main advantage of this format is compatibility with ASCII not only in the values of the codes, but also in the number of bits per character, since one byte is enough to encode any of the first 128 characters in UTF-8 (although, for example, for Cyrillic letters, two bytes).

The UTF-8 format was invented on September 2, 1992 by Ken Thompson and Rob Pike and implemented in Plan 9. The UTF-8 standard is now formalized in RFC 3629 and ISO / IEC 10646 Annex D.

For Web designer this encoding is of particular importance as it is declared the "standard document encoding" in HTML since version 4.

Text containing only characters with a number less than 128, when written in UTF-8, is converted to plain text ASCII. Conversely, in UTF-8 text, any byte with a value less than 128 represents an ASCII character with the same code. The rest of the Unicode characters are represented by sequences from 2 to 6 bytes long (actually only up to 4 bytes, since the use of codes greater than 221 is not planned), in which the first byte always looks like 11xxxxxx, and the rest - 10xxxxxx.

Simply put, in UTF-8 format, Latin characters, punctuation marks and control ASCII characters are written in US-ASCII codes, and all other characters are encoded using several octets with the most significant bit of 1. This has two effects.

Even if the program does not recognize Unicode, then letters, Arabic numerals and punctuation marks will display correctly.

If Latin letters and simple punctuation marks (including space) occupy a significant amount of text, UTF-8 gives a gain in volume compared to UTF-16.

At first glance, it might seem that UTF-16 is more convenient, since most characters are encoded in exactly two bytes. However, this is negated by the need to support surrogate pairs, which are often overlooked when using UTF-16, implementing only support for UCS-2 characters.

The standard was proposed in 1991 by the Unicode Consortium, Unicode Inc., a non-profit organization. The use of this standard makes it possible to encode a very big number characters from different scripts: Chinese characters may coexist in Unicode documents, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic, thus it becomes unnecessary to switch code pages.

The standard consists of two main sections: the universal character set (UCS) and the Unicode transformation format (UTF). The universal character set defines a one-to-one correspondence of characters to codes - elements of the code space that represent non-negative integers. The family of encodings defines the machine representation of a sequence of UCS codes.

The Unicode standard was developed with the goal of creating a uniform character encoding for all modern and many ancient written languages. Each character in this standard is encoded in 16 bits, allowing it to cover incomparably large quantity characters than previously accepted 8-bit encodings. Another important distinction Unicode from other encoding systems is that it not only assigns a unique code to each character, but also defines various characteristics of that character, for example:

Character type (uppercase letter, lowercase letter, number, punctuation mark, etc.);

Character attributes (left-to-right or right-to-left display, space, line break, etc.);

The corresponding uppercase or lowercase letter (for lowercase and uppercase letters, respectively);

The corresponding numeric value (for numeric characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds to either the alphabet of a language or a group special characters, similar in their functions. The diagram below provides a general listing of the Unicode 3.0 subsets (Figure 2).

Picture 2

The Unicode standard is the basis for storage and text in many modern computer systems... However, it is not compatible with most Internet protocols, since its codes can contain any byte values, and the protocols usually use bytes 00 - 1F and FE - FF as overhead. To achieve interoperability, several Unicode transformation formats (UTFs, Unicode Transformation Formats) have been developed, of which UTF-8 is the most common today. This format defines following rules converting each Unicode code into a set of bytes (one to three) that can be transported by Internet protocols.

Here x, y, z denote the bits of the source code that should be extracted, starting with the least significant one, and entered into the result bytes from right to left until all specified positions are filled.

Further development of the Unicode standard is associated with the addition of new language planes, i.e. characters in the ranges 10000 - 1FFFF, 20000 - 2FFFF, etc., where it is supposed to include the encoding for the scripts of dead languages that are not included in the table above. To encode these additional characters, new format UTF-16.

Thus, there are 4 main ways of encoding Unicode bytes:

UTF-8: 128 characters are encoded in one byte (ASCII format), 1920 characters are encoded in 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters), 63488 characters are encoded in 3 bytes (Chinese, Japanese and others) The remaining 2,147,418,112 characters (not yet used) can be encoded with 4, 5 or 6 bytes.

UCS-2: Each character is represented by 2 bytes. This encoding includes only the first 65,535 characters from the Unicode format.

UTF-16: This is an extension to UCS-2 and includes 1 114 112 Unicode characters. The first 65,535 characters are represented by 2 bytes, the rest by 4 bytes.

USC-4: Each character is encoded in 4 bytes.

Believe it or not, there is an image format built into the browser. This format allows you to load images before they are needed, provides rendering of the image on regular or retina screens and lets you add to CSS images... OK, that's not entirely true. This is not an aspect ratio, although everything else remains the same. Using it, you can create resolution independent icons that do not take time to load and are styled with using CSS.

What is Unicode?

Unicode is the ability to correctly display letters and punctuation from different languages on one page. It's incredibly useful: users will be able to interact with your site all over the world and it will show what you want - it can be French with accents or Kanji.

Unicode continues to evolve: now version 8.0 is relevant, with more than 120 thousand characters (in the original article, published in early 2014, it was about version 6.3 and 110 thousand characters).

Besides letters and numbers, there are other symbols and icons in Unicode. V latest versions these include emojis that you can see in the iOS messenger.

HTML pages are created from a sequence of Unicode characters and are converted to bytes when sent over the network. Each letter and each character of any language has its own unique code and is encoded when the file is saved.

When using the UTF-8 encoding system, you can directly insert Unicode characters into the text, but you can also add them to the text by specifying a digital symbolic link. For example, this is a heart symbol, and you can display that symbol simply by adding code to the markup.

This numeric reference can be specified in either decimal or hexadecimal format. The decimal format requires the addition of an x at the beginning, the notation will give the same heart () as previous option... (2665 is the hexadecimal version of 9829).

If you are adding a Unicode character using CSS, then you can only use hex values.

Some of the most commonly used Unicode characters have more memorable text names or abbreviations instead of numeric codes, such as the ampersand (& - &). Such symbols are called mnemonics in HTML, a complete list is available on Wikipedia.

Why should you use Unicode?

Good question, here are some reasons:

To use correct characters from different languages.
To replace icons.
To replace icons connected via @ font-face.
To define CSS classes

Valid characters

The first of the reasons does not require any additional actions... If the HTML is saved in UTF-8 format and its encoding is transmitted over the network as UTF-8, everything should work as it should.

Must. Unfortunately, not all browsers and devices support all Unicode characters in the same way (more precisely, not all fonts support full set characters). For example, recently added emoji characters are not supported everywhere.

For UTF-8 support in HTML5 add (if you do not have access to the server settings, you should also add ). The old doctype uses ( ).

Icons

The second reason for using Unicode is that there are many useful symbols that can be used as icons. For example, ≡ and.

Their obvious plus is that you don't need any additional files to add them to the page, which means your site will be faster. You can also change their color or add a drop shadow using CSS. And by adding transitions ( css transition) you can smoothly change the color of the icon when you hover over it without any additional images.

Let's say I want to include a star rating indicator on my page. I can do it like this:

★ ★ ★ ☆ ☆

You get the following result:

But if you're unlucky, you'll see something like this:

Same rating on BlackBerry 9000

This happens if the characters used are not in the browser or device font (fortunately, these asterisks are perfectly supported and old BlackBerry phones are the only exception here).

If there is no Unicode character, in its place there can be different characters from an empty square (□) to a diamond with a question mark (�).

How do you find a Unicode character that might work for your design? You can search for it on a site like Unicodinator by looking at the available characters, but there is also the best way... - this great site allows you to draw the icon you are looking for, and then offers you a list of similar Unicode characters.

Using Unicode with @ font-face icons

If you are using icons that connect with an external font via @ font-face, Unicode characters can be used as a fallback. This way you can display a similar Unicode character on devices or browsers where @ font-face is not supported:

On the left are the Font Awesome icons in Chrome, and on the right are their Unicode replacements in Opera Mini.

Many @ font-face matching tools use a range of Unicode characters from the private use area. The problem with this approach is that if @ font-face is not supported, character codes are passed to the user without any meaning.

Great for creating icon sets in @ font-face and allows you to choose a suitable Unicode character as the basis for the icon.

But be careful - some browsers and devices don't like individual characters Unicode when used with @ font-face. It makes sense to test Unicode character support with Unify - this app will help you determine how safe it is to use a character in the @ font-face iconset.

Unicode character support

The main problem with using Unicode characters as a fallback is poor support in screen readers (again, some information on this can be found on Unify), so it is important to choose the characters you use carefully.

If your icon is just a decorative element next to a text label readable by a screen reader, you don't need to worry too much. But if the icon is on its own, it's worth adding a hidden text label to help screen reader users. Even if a Unicode character is read by a screen reader, there is a chance that it will be very different from its intended purpose. For example, ≡ (≡) as a hamburger icon will be read as “identical” by VoiceOver on iOS.

Unicode in CSS class names

The fact that Unicode can be used in class names and in style sheets has been known since 2007. It was then that Jonathan Snook wrote about the use of Unicode characters in helper classes when typing rounded corners. This idea has not received much distribution, but it is worth knowing about the possibility of using Unicode in class names (special characters or Cyrillic).

Font selection

Few fonts support the full set of Unicode characters, so be sure to check for the characters you want when choosing a font.

Lots of icons in Segoe UI Symbol or Arial Unicode MS. These fonts are available on both PC and Mac; Lucida Grande also has a fair amount of Unicode characters. You can add these fonts to your font-family declaration to ensure that maximum number Unicode characters for users who have these fonts installed.

Determining Unicode Support

It would be great to be able to check for the presence of a particular Unicode character, but there is no guaranteed way to do this.

Unicode characters can be effective with support. For example, an emoji in the subject line makes it stand out from the rest in mailbox.

Conclusion

This article only covers the basics of Unicode. I hope you find it helpful in helping you understand Unicode better and use it effectively.

List of links

(Unicode based @ font-face iconset generator)
Shape Catcher (Unicode character recognition tool)
Unicodinator (Unicode character table)
Unify (Check for Unicode character support in browsers)
Unitools (A collection of tools for working with Unicode)

I myself do not really like headlines like "Pokemon in their own juice for dummies / pots / pans", but this seems to be exactly the case - we will talk about basic things, work with which quite often leads to a compartment of full of bumps and a lot of wasted time around the question - Why doesn't it work? If you are still afraid and / or do not understand Unicode, please, under cat.

What for?

The main question for a beginner who is faced with an impressive number of encodings and seemingly confusing mechanisms for working with them (for example, in Python 2.x). The short answer is because it happened :)

An encoding, who does not know, is the way of representing in the computer memory (read - in zeros-ones / numbers) digits, beeches and all other characters. For example, a space is represented as 0b100000 (in binary), 32 (in decimal), or 0x20 (in hexadecimal system reckoning).

So, once there was very little memory and all computers had enough 7 bits to represent all the necessary characters (numbers, lowercase / uppercase Latin alphabet, a bunch of characters and so-called controlled characters - all possible 127 numbers were given to someone). At that time there was only one encoding - ASCII. As time went on, everyone was happy, and whoever was not happy (read - who lacked the sign "" or the native letter "u") - used the remaining 128 characters at their discretion, that is, they created new encodings. This is how ISO-8859-1 and our (that is, Cyrillic) cp1251 and KOI8 appeared. Together with them, the problem of interpreting bytes like 0b1 ******* (that is, characters / numbers from 128 to 255) appeared - for example, 0b11011111 in cp1251 encoding is our own "I", at the same time in ISO encoding 8859-1 is the Greek German Eszett (prompts) "ß". As expected, network communication and just file exchange between different computers turned into hell-knows-what, despite the fact that headers like "Content-Encoding" in HTTP protocol, emails and HTML pages saved the day a bit.

At this moment, bright minds gathered and offered new standard- Unicode. This is a standard, not an encoding - Unicode itself does not determine how characters will be stored on the hard disk or transmitted over the network. It only defines the relationship between a character and a certain number, and the format according to which these numbers will be converted into bytes is determined by Unicode encodings (for example, UTF-8 or UTF-16). On the this moment there are a little over 100 thousand characters in the Unicode standard, while UTF-16 can support over one million (UTF-8 is even more).

I advise you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets for more and more fun on the topic.

Get to the point!

Naturally, there is support for Unicode in Python. But, unfortunately, only in Python 3 all strings became unicode, and beginners have to kill themselves about an error like:

>>> with open ("1.txt") as fh: s = fh.read () >>> print s koshchey >>> parser_result = u "baba-yaga" # assignment for clarity, let's imagine that this is the result some parser works >>> ", line 1, in parser_result + s UnicodeDecodeError: "ascii" codec can "t decode byte 0xea in position 0: ordinal not in range (128)
or like this:
>>> str (parser_result) Traceback (most recent call last): File " ", line 1, in str (parser_result) UnicodeEncodeError: "ascii" codec can "t encode characters in position 0-3: ordinal not in range (128)
Let's figure it out, but in order.

Why would anyone use Unicode?

Why is my favorite html parser returning Unicode? Let it return an ordinary string, and I'll deal with it there! Right? Not really. Although each of the characters existing in Unicode can (probably) be represented in some single-byte encoding (ISO-8859-1, cp1251 and others are called single-byte, since they encode any character in exactly one byte), but what to do if there should be characters in the string from different encodings? Assign a separate encoding to each character? No, of course you have to use Unicode.

Why do we need new type"Unicode"?

So we got to the most interesting thing. What is a string in Python 2.x? It's simple bytes... Just binary data that can be anything. In fact, when we write something like: >>> x = "abcd" >>> x "abcd" the interpreter does not create a variable that contains the first four letters of the Latin alphabet, but only the sequence ("a", "b "," c "," d ") with four bytes, and Latin letters are used here exclusively to designate this particular byte value. So "a" here is just a synonym for "\ x61", and not a bit more. For instance:

>>> "\ x61" "a" >>> struct.unpack ("> 4b", x) # "x" is just four signed / unsigned chars (97, 98, 99, 100) >>> struct.unpack ("> 2h", x) # or two short (24930, 25444) >>> struct.unpack ("> l", x) # or one long (1633837924,) >>> struct.unpack ("> f", x) # or float (2.6100787562286154e + 20,) >>> struct.unpack ("> d", x * 2) # or half a double (1.2926117739473244e + 161,)
And that's it!

And the answer to the question - why do we need "unicode" is already more obvious - we need a type that will be represented by characters, not bytes.

Okay, I figured out what the string is. Then what is Unicode in Python?

"Type unicode" is primarily an abstraction that implements the idea of Unicode (a set of characters and associated numbers). An object of the "unicode" type is no longer a sequence of bytes, but a sequence of actual characters without any idea of how these characters can be effectively stored in computer memory. If you prefer, this is a higher level of abstraction than byte strings (this is what Python 3 calls regular strings that are used in Python 2.6).

How do I use Unicode?

A Unicode string in Python 2.6 can be created in three (at least naturally) ways:

u "" literal: >>> u "abc" u "abc"
Method "decode" for byte string: >>> "abc" .decode ("ascii") u "abc"
"Unicode" function: >>> unicode ("abc", "ascii") u "abc"

ascii in the last two examples is specified as the encoding that will be used to convert bytes to characters. The stages of this transformation look something like this:

"\ x61" -> ascii encoding-> latin lowercase "a" -> u "\ u0061" (unicode-point for this letter) or "\ xe0" -> c1251 encoding -> cyrillic lowercase "a" -> u "\ u0430"

How to get a regular string from a unicode string? Encode it:

>>> u "abc" .encode ("ascii") "abc"

The coding algorithm is naturally the opposite of the one given above.

Remember and not confused - unicode == characters, string == bytes, and bytes -> something meaningful (characters) is decode, and characters -> bytes are encode.

Not encoded :(

Let's look at examples from the beginning of the article. How does string and unicode string concatenation work? Simple string must be converted to a unicode string, and since the interpreter does not know the encoding, it uses the default encoding - ascii. If this encoding fails to decode the string, we get an ugly error. In this case, we need to convert the string to a unicode string ourselves, using the correct encoding:

>>> print type (parser_result), parser_result baba-yaga >>> s = "Koschey" >>> parser_result + s Traceback (most recent call last): File " ", line 1, in parser_result + s UnicodeDecodeError: "ascii" codec can "t decode byte 0xea in position 0: ordinal not in range (128) >>> parser_result + s.decode (" cp1251 ") u" \ xe1 \ xe0 \ xe1 \ xe0- \ xff \ xe3 \ xe0 \ u043a \ u043e \ u0449 \ u0435 \ u0439 ">>> print parser_result + s.decode (" cp1251 ") baba yagakoschey >>> print" & ". join ((parser_result, s.decode ("cp1251"))) Baba Yaga & Koschey # It's better this way :)

A "UnicodeDecodeError" is usually an indication to decode the string to Unicode using the correct encoding.

Now using "str" and unicode strings. Do not use "str" and unicode strings :) In "str" there is no way to specify the encoding, so the default encoding will always be used and any characters> 128 will lead to an error. Use the "encode" method:

>>> print type (s), s Koschey >>> str (s) Traceback (most recent call last): File " ", line 1, in str (s) UnicodeEncodeError: "ascii" codec can "t encode characters in position 0-4: ordinal not in range (128) >>> s = s.encode (" cp1251 ") >>> print type (s), s koschey

"UnicodeEncodeError" is a sign that we need to specify the correct encoding when converting a unicode string to a regular one (or use the second parameter "ignore" \ "replace" \ "xmlcharrefreplace" in the "encode" method).

I want more!

Okay, let's use Baba Yaga from the example above again:

>>> parser_result = u "baba-yaga" # 1 >>> parser_result u "\ xe1 \ xe0 \ xe1 \ xe0- \ xff \ xe3 \ xe0" # 2 >>> print parser_result áàáà-ÿãà # 3 >>> print parser_result.encode ("latin1") # 4 baba yaga >>> print parser_result.encode ("latin1"). decode ("cp1251") # 5 baba yaga >>> print unicode ("baba yaga", "cp1251") # 6 baba-yaga
The example is not entirely simple, but there is everything (well, or almost everything). What's going on here:

What do we have at the entrance? The bytes that IDLE passes to the interpreter. What do you need at the exit? Unicode, that is, characters. It remains to turn the bytes into characters - but you need an encoding, right? What encoding will be used? We look further.
Here important point: >>> "baba-yaga" "\ xe1 \ xe0 \ xe1 \ xe0- \ xff \ xe3 \ xe0" >>> u "\ u00e1 \ u00e0 \ u00e1 \ u00e0- \ u00ff \ u00e3 \ u00e0" == u "\ xe1 \ xe0 \ xe1 \ xe0- \ xff \ xe3 \ xe0" True, as you can see, Python does not bother with the choice of encoding - bytes are simply converted into unicode points:
>>> ord ("a") 224 >>> ord (u "a") 224
Only here is the problem - the 224th character in cp1251 (the encoding used by the interpreter) is not at all the same as 224 in Unicode. It is because of this that we get cracked when trying to print our unicode string.
How to help a woman? It turns out that the first 256 Unicode characters are the same as in the ISO-8859-1 \ latin1 encoding, respectively, if we use it to encode a unicode string, we get the bytes that we entered ourselves (who cares - Objects / unicodeobject.c, looking for the definition of the function "unicode_encode_ucs1"):
>>> parser_result.encode ("latin1") "\ xe1 \ xe0 \ xe1 \ xe0- \ xff \ xe3 \ xe0"
How do you get a baba in unicode? It is necessary to indicate which encoding to use:
>>> parser_result.encode ("latin1"). decode ("cp1251") u "\ u0431 \ u0430 \ u0431 \ u0430- \ u044f \ u0433 \ u0430"
The method from point # 5 is certainly not so hot, it is much more convenient to use built-in unicode.

It's actually not all that bad with "u" "" literals, since the problem only occurs in the console. Indeed, in the case of using non-ascii characters in source file Python will insist on using a header like "# - * - coding: - * -" (PEP 0263) and unicode strings will use the correct encoding.

There is also a way to use "u" "" to represent, for example, Cyrillic, without specifying the encoding or unreadable unicode points (that is, "u" \ u1234 ""). The way is not entirely convenient, but interesting is to use unicode entity codes:

>>> s = u "\ N (CYRILLIC SMALL LETTER KA) \ N (CYRILLIC SMALL LETTER O) \ N (CYRILLIC SMALL LETTER SHCHA) \ N (CYRILLIC SMALL LETTER IE) \ N (CYRILLIC SMALL LETTER SHORT I)"> >> print s koshchey

Well, that's all. The main advice is not to confuse "encode" \ "decode" and understand the differences between bytes and characters.

Python 3

Here without a code, because there is no experience. Witnesses say that everything is much simpler and more fun there. Who will undertake on cats to demonstrate the differences between here (Python 2.x) and there (Python 3.x) - respect and respect.

Healthy

Since we are talking about encodings, I will recommend a resource that from time to time helps to overcome krakozyabry - http://2cyr.com/decode/?lang=ru.

Tags:

python
unicode
encoding

Add tags