How to set up smartphones and PCs. Informational portal
  • home
  • Reviews
  • Rle coding examples. Non-existence of an ideal algorithm

Rle coding examples. Non-existence of an ideal algorithm

Using the RLE algorithm, encode the character sequence BBBBBBACCCABBBBBB

Write the result as hexadecimal codes (each character is encoded as a byte, which is represented by two hexadecimal digits). Check the result using the RLE program.

Decode the sequence packed using the RLE algorithm (hexadecimal codes are given): 01 4D 8E 41 01 4D 8E 4116. Use the ASCII table to determine characters by their hexadecimal code. In the above table, the first line contains the first digit of the hexadecimal character code, and the second line contains the second. For example, the character "&" has the hexadecimal code 2616.

Determine the number of bytes in the original and decompressed sequence and calculate the compression ratio: Check the result obtained in previous paragraph, using the RLE program. Suggest two ways to check. Construct sequences that are compressed by the RLE algorithm exactly 2 times, 4 times, 5 times. Check your answers with the RLE program. Think of three sequences that cannot be compressed using the RLE algorithm: Using the RLE program, apply RLE compression to the following files and find the compression ratio for each of them: Explain the results obtained in the previous paragraph:
    Why can't I compress pictures in JPEG format? why for two drawings in BMP format the same size compression ratios according to the RLE algorithm are so different? Hint: open these drawings in any viewer.
Estimate the maximum achievable compression ratio using the textbook variant of the RLE algorithm. In what case can it be achieved?
Estimate the compression ratio using the worst-case RLE algorithm. Describe this worst case.

  • tutorial

A long time ago, when I was still a naive schoolboy, I suddenly became terribly curious: how does the data in the archives magically take up less space? Riding my faithful dialup, I began to surf the Internet in search of an answer, and found many articles with a fairly detailed presentation of the information I was interested in. But none of them seemed easy to understand then - the code listings seemed like Chinese letters, and attempts to understand unusual terminology and various formulas were unsuccessful.

Therefore, the purpose of this article is to give an idea of ​​the simplest compression algorithms to those whose knowledge and experience do not yet allow them to immediately understand more professional literature, or whose profile is completely far from such topics. Those. I will “on my fingers” talk about some of the simplest algorithms and give examples of their implementation without kilometer-long code listings.

I will immediately warn you that I will not consider the details of the implementation of the encoding process and such nuances as the efficient search for occurrences of a string. The article will only touch upon the algorithms themselves and ways of presenting the result of their work.

RLE - Compactness of Uniformity

RLE algorithm is probably the simplest of all: its essence lies in the coding of repetitions. In other words, we take sequences of identical elements, and “collapse” them into count/value pairs. For example, a string like "AAAAAAAABCCCC" could be converted to an entry like "8xA, B, 4xC". This is, in general, everything you need to know about the algorithm.

Implementation example

Suppose we have a set of some integer coefficients that can take values ​​from 0 to 255. Logically, we came to the conclusion that it is reasonable to store this set as an array of bytes:
unsigned char data = ( 0, 0, 0, 0, 0, 0, 4, 2, 0, 4, 4, 4, 4, 4, 4, 4, 80, 80, 80, 80, 0, 2, 2 , 2, 2, 255, 255, 255, 255, 255, 0, 0);

For many, it will be much more familiar to see this data as a hex dump:
0000: 00 00 00 00 00 00 04 02 00 04 04 04 04 04 04 04
0010: 50 50 50 50 00 02 02 02 02 FF FF FF FF 00 00

On reflection, we decided that it would be good to somehow compress such sets to save space. To do this, we analyzed them and identified a pattern: very often there are subsequences consisting of the same elements. Of course, RLE is perfect for this!

Let's encode our data using the newly obtained knowledge: 6x0, 4, 2, 0, 7x4, 4x80, 0, 4x2, 5x255, 2x0.

It's time to somehow present our result in a computer-friendly way. To do this, in the data stream, we must somehow separate single bytes from encoded strings. Since the entire range of byte values ​​is used by our data, it will not be possible to simply allocate any ranges of values ​​​​for our purposes.

There are at least two ways out of this situation:

  1. As an indicator of a compressed chain, select one byte value, and in case of a collision with real data, escape them. For example, if we use the value 255 for “official” purposes, then when this value is encountered in the input data, we will have to write “255, 255” and use a maximum of 254 after the indicator.
  2. Structure the encoded data, indicating the number of not only repeated, but also subsequent single elements. Then we will know in advance where what data is.
The first method in our case does not seem effective, so, perhaps, we will resort to the second.

So now we have two kinds of sequences: strings of single elements (like "4, 2, 0") and strings of identical elements (like "0, 0, 0, 0, 0, 0"). Let's allocate one bit in the "service" bytes for the sequence type: 0 - single elements, 1 - identical. Take for this, say, the most significant bit of a byte.

In the remaining 7 bits, we will store the lengths of the sequences, i.e. the maximum length of the encoded sequence is 127 bytes. We could allocate, say, two bytes for service needs, but in our case such long sequences are extremely rare, so it is easier and more economical to simply break them into shorter ones.

It turns out that in the output stream we will first write the length of the sequence, and then either one repeated value, or a chain of non-repeating elements of the specified length.

The first thing that should catch your eye is that in this situation we have a couple of unused values. There cannot be zero-length sequences, so we can increase maximum length up to 128 bytes, subtracting one from the length when encoding and adding when decoding. So we can encode lengths from 1 to 128 instead of lengths from 0 to 127.

The second thing to note is that there are no sequences of identical elements of unit length. Therefore, when encoding, we will subtract one more from the length of such sequences, thereby increasing their maximum length to 129 (the maximum length of a chain of single elements is still 128). Those. chains of identical elements can have a length from 2 to 129.

Let's encode our data again, but now in a form understandable to the computer. We will write the service bytes as , where T is the sequence type and L is the length. We will immediately take into account that we write the lengths in a modified form: at T=0 we subtract one from L, at T=1 - two.

0, , 4, 2, 0, , 4, , 80, , 0, , 2, , 255, , 0

Let's try to decode our result:

  • . T=1, then the next byte will be repeated L+2 (4+2) times: 0, 0, 0, 0, 0, 0.
  • . T=0, so just read L+1 (2+1) bytes: 4, 2, 0.
  • . T=1, repeat next byte 5+2 times: 4, 4, 4, 4, 4, 4, 4.
  • . T=1, repeat next byte 2+2 times: 80, 80, 80, 80.
  • . T=0, read 0+1 bytes: 0.
  • . T=1, repeat byte 2+2 times: 2, 2, 2, 2.
  • . T=1, repeat byte 3+2 times: 255, 255, 255, 255, 255.
  • . T=1, repeat byte 0+2 times: 0, 0.

And now last step: store the result as an array of bytes. For example, a pair packed into a byte would look like this:

As a result, we get the following:
0000: 84 00 02 04 02 00 85 04 82 80 00 00 82 02 83 FF
0010: 80 00

In such a simple way, this example of the input data, we got 18 out of 32 bytes. Not a bad result, especially considering that it can be much better on longer chains.

Possible improvements

The efficiency of an algorithm depends not only on the algorithm itself, but also on how it is implemented. Therefore, for different data it is possible to develop different variations of encoding and presentation of encoded data. For example, when encoding images, you can make chains of variable length: allocate one bit for indicating a long chain, and if it is set to one, then store the length in the next byte too. This is how we sacrifice the length of short chains (65 elements instead of 129), but on the other hand we make it possible to encode chains up to 16385 elements (2 14 + 2) in just three bytes!

Additional efficiency can be achieved by using heuristic coding techniques. For example, let's encode the following chain in our way: "ABBA". We get " , A, , B, , A" - i.e. We turned 4 bytes into 6, inflated the original data by as much as one and a half times! And the more such short alternating heterogeneous sequences, the more redundant data. If we take this into account, then we could encode the result as ", A, B, B, A" - we would spend only one extra byte.

LZ77 - brevity in repetition

LZ77 is one of the simplest and most well-known algorithms in the LZ family. Named after its creators: Abraham Lempel (Abraham L empel) and Jacob Ziv (Jacob Z iv). The numbers 77 in the title mean 1977, in which an article was published describing this algorithm.

The main idea is to encode identical sequences of elements. That is, if some chain of elements occurs more than once in the input data, then all its subsequent occurrences can be replaced by “links” to its first instance.

Like the rest of the algorithms of this family of the family, LZ77 uses a dictionary that stores previously encountered sequences. To do this, he applies the principle of the so-called. " sliding window': the area always before the current encoding position within which we can address links. This window is the dynamic dictionary for this algorithm- each element in it corresponds to two attributes: position in the window and length. Although in the physical sense, this is just a piece of memory that we have already encoded.

Implementation example

Let's try to code something now. Let's generate some suitable string for this (I apologize in advance for its absurdity):

"The compression and the decompression leave an impression. Hahahahaha!"

Here is how it will look in memory (ANSI encoding):
0000: 54 68 65 20 63 6F 6D 70 72 65 73 73 69 6F 6E 20 The compression
0010: 61 6E 64 20 74 68 65 20 64 65 63 6F 6D 70 72 65 and the decompre
0020: 73 73 69 6F 6E 20 6C 65 61 76 65 20 61 6E 20 69 ssion leave an i
0030: 6D 70 72 65 73 73 69 6F 6E 2E 20 48 61 68 61 68 Hahah
0040: 61 68 61 68 61 21 ahaha!

We have not yet decided on the size of the window, but we will agree that it over size encoded string. Let's try to find all repeating strings of characters. A string is a sequence of characters longer than one. If the chain is part of a longer repeating chain, we will ignore it.

“The compression and t de leave[an] i . Hah!"

For greater clarity, let's look at the diagram, where we can see the correspondence of repeated sequences and their first occurrences:

Perhaps the only unclear point here is the sequence “Hahahahaha!”, because the string of characters “ahahaha” corresponds to the short string “ah”. But there is nothing unusual here, we used some trick that allows the algorithm to sometimes work like the RLE described earlier.

The fact is that when unpacking, we will read the specified number of characters from the dictionary. And since the whole sequence is periodic, i.e. the data in it is repeated with a certain period, and the symbols of the first period will be right before the unpacking position, then we can recreate the entire chain from them, simply copying the symbols of the previous period into the next one.

Got it sorted out. Now let's replace the found repetitions with references to the dictionary. We will write the link in the format , where P is the position of the first occurrence of the string in the string, L is its length.

“The compression and t de leave i . Hah!"

But do not forget that we are dealing with a sliding window. For better understanding, so that the links do not depend on the size of the window, we will replace the absolute positions with the difference between them and the current encoding position.

“The compression and t de leave i . Hah!"

Now we just need to subtract P from the current encoding position to get the absolute position in the string.

It's time to decide on the window size and the maximum length of the encoded phrase. Since we are dealing with text, it is rare that there will be particularly long repeating sequences in it. So let's allocate, say, 4 bits for their length - a limit of 15 characters at a time is quite enough for us.

But the size of the window already depends on how deep we will look for the same chains. Since we are dealing with small texts, it will be just right to supplement the number of bits we use to two bytes: we will address links in a range of 4096 bytes, using 12 bits for this.

We know from experience with RLE that not all values ​​can be used. Obviously, the link can have a minimum value of 1, therefore, in order to address back in the range 1..4096, we must subtract one from the link when encoding, and add back when decoding. The same with the lengths of sequences: instead of 0..15 we will use the range 2..17, since we do not work with zero lengths, and individual characters are not sequences.

So, let's imagine our encoded text, taking into account these amendments:

“The compression and t de leave i . Hah!"

Now, again, we need to somehow separate the compressed chains from the rest of the data. The most common way is to use the structure again and directly indicate where the data is compressed and where it is not. To do this, we will divide the encoded data into groups of eight elements (characters or links), and before each of these groups we will insert a byte, where each bit corresponds to the element type: 0 for a character and 1 for a link.

We divide into groups:

  • "The Comp"
  • "remission"
  • "and t de"
  • "leave"
  • "i. Hah"
Compiling groups:

"(0,0,0,0,0,0,0,0) The comp(0,0,0,0,0,0,0,0) ression (0,0,0,0,0,1 ,0,0) and t de(1,0,0,0,0,0,1,0) leave (0,1,0,0,0,0,0,1) i . Hah(0)!"

Thus, if we encounter bit 0 during unpacking, then we simply read the character into the output stream, if bit is 1, we read the link, and by reference we read the sequence from the dictionary.

Now we just have to group the result into an array of bytes. Let's agree that we use bits and bytes in order from high to low. Let's see how links are packed into bytes using an example:

As a result, our compressed stream will look like this:

0000:00 54 68 65 20 63 6f 6d 70 00 72 65 73 73 69 6f #The comp#ressio
0010: 6e 20 04 61 6e 64 20 74 01 31 64 65 82 01 5a 6c n #and t##de###l
0020: 65 61 76 65 01 b1 20 41 69 02 97 2e 20 48 61 68 eave## #i##. Hah
0030: 00 15 00 21 00 00 00 00 00 00 00 00 00 00 00 00 ###!

Possible improvements

In principle, everything that was described for RLE will be true here. In particular, to demonstrate the usefulness of the heuristic approach in coding, consider the following example:

"The long goooooong. The loooooower bound."

Let's find sequences only for the word "looooooower":

"The long goooooong. The wer bound."

To encode such a result, we need four bytes per link. However, it would be more economical to do this:

"The long goooooong. The l wer bound."

Then we would spend one byte less.

Instead of a conclusion

Despite their simplicity and, it would seem, not too great efficiency, these algorithms are still widely used in various areas of the IT field.

Their advantage is simplicity and speed, and more complex and efficient algorithms can be based on their principles and their combinations.

I hope that the essence of these algorithms stated in this way will help someone understand the basics and start looking towards more serious things.

Even 10-15 years ago, archivers were used mainly to save space on hard drives and to fit as much data on a floppy disk. However, times have changed. No one has been using floppy disks as a means of transferring and storing information for a long time, and the cost of drives has become so low that no one even thinks about compressing data in order to save space. In addition, data volumes have become so large that archiving and unarchiving them to save space is simply not practical, because it takes a lot of time. Well, indeed, today the amount of data on users' drives is measured in terabytes. Now imagine how long it would take to archive one terabyte of data. This will take not one or even two hours, but at least 10-12 hours, that is, the computer will have to be left on all night.

It would seem that archivers today should gradually lose their relevance. But nothing like that happens. The vast majority of users, among other programs, have archivers installed, or they use an archiver built into the Windows operating system (we do not consider other operating systems in this publication).

The fact is that the appointment of archivers has changed. Today they are used primarily for posting data to the Web. Most drivers on manufacturers' websites are laid out in archives, and most of the programs on various resources are also archived. By the way, the user himself, before uploading any data to the Network (for example, to file-sharing resources), packs the data into an archive.

Concerning Russian market, then we have the most common three archivers: WinRAR, WinZip and 7-Zip, presented in both 32- and 64-bit versions. It is them that we will compare in this article. However, first, let's take a quick look at some theoretical aspects archiving work.

Information compression algorithms

The essence of any information compression algorithm is to obtain, by some transformation of the initial set of information bits, at the output a certain set smaller. Moreover, all data transformation algorithms can be conditionally divided into two classes: reversible and irreversible.

An irreversible information compression algorithm means such a transformation of the original bit sequence, in which the output sequence of a smaller size does not allow the input sequence to be restored exactly. Irreversible compression algorithms are used, for example, in relation to graphic, video and audio data, and this always leads to a loss of their quality. Naturally, algorithms for irreversible data compression are not used in archivers, and therefore we will not consider them further.

Reversible data compression algorithms allow you to exactly restore the original data sequence from the compressed sequence. It is these algorithms that are used in archivers. General characteristics of all compression algorithms are the compression ratio (the ratio of the volumes of the original and compressed data sequence), the compression rate (the time spent archiving a certain amount of data) and the compression quality (a value showing how much the data sequence is compressed by applying re-compression to it according to this or some other algorithm).

The theory of information compression, also known as the theory of economical coding of discrete information, is a rather complex branch of mathematical science. Of course, the presentation of even its basics is far beyond the scope of this article, and therefore we will only superficially consider various information compression algorithms without going into details. In addition, a lot of data compression algorithms have been developed today, and their consideration is also part of our task. Therefore, we will only talk at a qualitative level about several algorithms that allow us to get a general idea of ​​\u200b\u200bthe methods of information compression.

RLE algorithm

One of the oldest and simplest information compression methods is the RLE (Run Length Encoding) algorithm, i.e. the algorithm for encoding series of sequences. This method is very simple to implement and is one of the archiving algorithms, and its essence is to replace a series (group) of repeated bytes with one encoding byte and a counter for the number of their repetitions. That is, a group of identical bytes will be replaced by a pair:<счетчик повторений, значение>, which reduces data redundancy.

In this algorithm, the sign of the counter is the ones in the top two bits of the read byte. For example, if the first two bits are 11, then the remaining 6 bits are allocated to the counter, which can take values ​​from 1 to 64. Accordingly, a series of 64 repeated bytes can be defined with just two bytes, that is, compressed by 32 times.

There is another implementation of this algorithm, when the sign of the counter is 1 in the first byte of the counter. In this case, the counter can take maximum value, equal to 127 - and therefore the maximum compression ratio will be equal to 64.

It is clear that the RLE algorithm is only efficient when there are a large number of long groups of identical bytes. If the bytes are not repeated, then using the RLE method leads to an increase in the file size.

The RLE method is generally very efficient for compressing bitmap graphics (BMP, PCX, TIF, GIF) because they contain very long series of repeating byte sequences.

Information Alphabet Restriction

Another fairly simple way to compress information can be called a constraint informational alphabet. We immediately note that in practice such an algorithm is not implemented, but the presentation of this method will help to better understand the method of variable length codes.

In what follows, by the informational alphabet we will mean the set of symbols used to encode the informational sequence. For example, suppose there is some text message. Each letter of this message is encoded using an ASCII table consisting of 256 characters. In this case, exactly 8 bits (1 byte) are allocated for encoding each character. In this case, the information alphabet is all 256 characters of the ASCII encoding table.

It is clear that not all 256 characters of the ASCII table may be used in the original text message. For example, if we are talking for a text message in Russian that does not contain numbers, 64 characters (33 lowercase and 31 capital letters) are sufficient. If we add to this punctuation marks, paragraph and newline marks, it becomes clear that the number of characters will not exceed 100. In this case, you can use not 8-, but 7-bit character encoding, which allows you to get a table of 128 characters. That is, we, as it were, limit the informational alphabet, due to which we can reduce the bit depth of each collated character. You can go further - to accurately determine the number of characters used in a text message. If, for example, it turns out that only 30 characters are used in the message (including newlines), then a 5-bit encoding table containing 32 characters can be used, and then the compression ratio of the text message will become even greater. Indeed, if 8-bit character encoding is used in the original message, and 5-bit character encoding is used in the compressed message, then the compression ratio will be 8/5.

Despite the apparent simplicity of the method of limiting the alphabet, in practice it is not used. The fact is that the described method does not allow correctly decoding the original message without transmitting additional information. Indeed, without knowing the encoding tables used to compress information, it is impossible to decode it. That is, along with the encoded information sequence, it is necessary to transmit encoding tables. It is clear that in this case the gain from using a limited alphabet is reduced to zero.

The restricted alphabet method has other disadvantages as well. If the original information message contains a large number of various characters, then it will not be possible to reduce the bit depth of the representation of alphabetic characters, and this method simply will not work. Suppose, for example, that the original information message contains 129 characters from a 256-character alphabet. It will not be possible to use 7-bit character encoding in this case, since 7 bits will only encode 128 characters. Therefore, for 129 characters, you will have to use the same 8-bit encoding as in the original 256-character alphabet.

Variable length codes

One of the main disadvantages of the hypothetical alphabet restriction method we have considered is that it uses a uniform code when all characters of the information alphabet have the same length (8.7 bits or less). It would be more logical to use such a coding system in which the length of the character code depends on the frequency of its occurrence in the information message. That is, if in the original informational message some characters occur more often than others, then it is optimal for them to use short codes, and for rare ones, longer ones.

As a hypothetical example, consider the following informational message: "air crash" , which contains 14 characters. Suppose we have an alphabet of 14 characters that allows us to encode this message. If a uniform code is used, then 4 bits are required for each character of the alphabet (a code length of 4 bits will form 16 characters). However, it is easy to see that in our information message the symbol "a" occurs five times, the symbol "t" - two times, and the rest of the symbols - once. If for the symbol "a" we use a code of 2 bits long, for the symbol "t" - 3 bits long, and for the remaining characters - 4 bits long, then we can certainly save money. It is only necessary to understand exactly how to build codes of variable length and how exactly the length of the code should depend on the frequency of the symbol in the information message.

If all characters are included in the information message with the same frequency (equiprobable), then with an informational alphabet of N characters, exactly log 2 will be required to encode each character N bit. In fact, this is a case of uniform code.

If the characters have different probability of appearing in the information message, then, according to the theory of K. Shannon, the character, the probability of which is equal to p, is optimal and, what is especially important, it is theoretically permissible to associate a code of length -log 2 p. Returning to our example with the information message "air crash" and given that the probability of the appearance of the character "a" (p(a)) is 5/14, the probability of the appearance of the character "t" is 2/14, and the probability of the occurrence of all other characters is 1 /14, we get that: for the character "a", the optimal code length is -log 2 (5/14) = 1.48 bits; for the character "t" - -log 2 (2/14) = 2.8 bits, and for all other characters it is -log 2 (1/14) = 3.8. Since in our case the length of the code can only have an integer value, then, rounding up, we get that for the symbol “a” it is optimal to use a code of 2 bits long, for the symbol “t” - 3 bits long, and for the rest - 4 bits long.

Let's calculate the compression ratio when using this encoding. If a uniform code based on a 14-character alphabet were used, then 56 bits would be required to encode the word "air crash". When using variable length codes, 5×2 bits + 2×3 bits + 7×4 bits = 44 bits will be required, i.e. the compression ratio will be 1.27.

Now consider the algorithms for obtaining variable length codes.

prefix encoding

Most simple method obtaining codes of variable length is the so-called prefix coding, which allows you to receive integer-length codes. main feature prefix codes lies in the fact that within each of their systems shorter codes do not coincide with the beginning (prefix) of longer codes. It is this property of prefix codes that makes it very easy to decode information.

Let us explain this property of prefix codes with a specific example. Let there be a system of three prefix codes: (0, 10, 11). As you can see, the shorter code 0 does not coincide with the beginning of the longer codes 10 and 11. Let code 0 specify the character "a", code 10 the character "m", and code 11 the character "p". Then the word "frame" is encoded by the sequence 110100. Let's try to decode this sequence. Since the first bit is a 1, the first character can only be "m" or "p" and is determined by the value of the second bit. Since the second bit is a 1, the first character is "p". The third bit is 0 and it uniquely matches the character "a". The fourth bit is 1, that is, you need to look at the value of the next bit, which is 0, then the third character is "m". The last bit is 0, which uniquely matches the character "a". Thus, the property of prefix codes, which consists in the fact that shorter codes cannot coincide with the beginning of longer codes, makes it possible to unambiguously decode an information message encoded with prefix codes of variable length.

Prefix codes are usually obtained by constructing code (for binary system) trees. Each internal node of such a binary tree has two outgoing edges, with one edge corresponding to the binary symbol "0", and the other - "1". For definiteness, we can agree that the left edge should be assigned the symbol "0", and the right edge - the symbol "1".

Since there can be no cycles in a tree, there can always be only one route from the root node to the leaf node. If the edges of the tree are numbered, then each such route corresponds to some unique binary sequence. The set of all such sequences will form a system of prefix codes.

For the considered example of a system of three prefix codes: (0, 10, 11), which define the symbols "a", "m" and "p", the code tree is shown in Fig. one.

Rice. 1. Code tree for the system
of three prefix codes: (0, 10, 11)

The convenience of the tree representation of prefix codes lies in the fact that it is the tree structure that makes it possible to form codes of variable length that meet the main condition of prefix codes, that is, the condition that shorter codes do not coincide with the beginning of longer codes.

So far, we have considered only the idea of ​​variable length prefix codes. As for the algorithms for obtaining prefix codes, they can be developed quite a lot, but the most famous are two methods: Shannon-Fano and Huffman.

Shannon-Fano algorithm

This algorithm for obtaining prefix codes was independently proposed by R. Fano and K. Shannon, it is as follows. At the first step, all symbols of the initial information sequence are sorted in descending or increasing probabilities of their occurrence (frequencies of their occurrence), after which the ordered series of symbols in some place is divided into two parts so that in each of them the sum of the symbol frequencies is approximately the same. As a demonstration, consider the already familiar word "air crash" .

If the symbols that make up given word, sorted in descending order of the frequency of their occurrence, then we get the following sequence: (a(5), m(2), v(1), u(1), k(1), c(1), p(1), o (1), f(1)) (in brackets the frequency of repetition of a symbol in a word is indicated). Next, we need to divide this sequence into two parts so that in each of them the sum of the symbol frequencies is approximately the same (as far as possible). It is clear that the section will pass between the symbols "t" and "c", as a result of which two sequences are formed: (a (5), t (2)) and (in (1), u (1), k (1), c(1), p(1), o(1), φ(1)). Moreover, the sums of the frequencies of repetition of symbols in the first and second sequences will be the same and equal to 7.

At the first step of dividing a sequence of characters, we get the first digit of the code of each character. The rule here is simple: for those characters that are in the sequence on the left, the code will start with "0", and for those on the right - with "1".

In particular, the sequence (a(5), m(2)) will split into two separate characters: a(5) and m(2) (there are no other division options). Then the second digit of the code for the symbol "a" is "0", and for the symbol "t" - "1". Since, as a result of dividing the sequence, we got individual elements, then they are no longer divisible and for the symbol "a" we get the code 00, and for the symbol "t" - the code 01.

The sequence (in(1), u(1), k(1), c(1), p(1), o(1), f(1)) can be divided either into sequences (in(1), u( 1), k(1)) and (c(1), p(1), o(1), φ(1)), or on (v(1), u(1), k(1)), (c(1)) and (p(1), o(1), φ(1)).

In the first case, the sum of the frequencies of the symbols in the first and second sequences will be 3 and 4, respectively, and in the second case, 4 and 3, respectively. For definiteness, we use the first option.

For the characters of the resulting new sequence (v(1), u(1), k(1)) (this is the left sequence), the first two digits of the code will be 10, and for the sequence (c(1), p(1), o(1 ), f(1)) - 11.

In our example (Fig. 2 and 3), the following system of prefix codes is obtained: "a" - 00, "t" - 01, "c" - 100, "i" - 1010, "k" - 1011, "s" - 1100, "r" - 1101, "o" - 1110, "f" - 1111. As you can see, shorter codes are not the beginning of longer codes, that is, the main property of prefix codes is fulfilled.

Rice. 2. Demonstration of the Shannon-Fano algorithm

Rice. 3. Code tree for the word "air crash"
in the Shannon-Fano algorithm

Huffman algorithm

The Huffman algorithm is another algorithm for obtaining variable length prefix codes. Unlike the Shannon-Fano algorithm, which provides for the construction code tree from top to bottom, this algorithm involves building a code tree in reverse order, that is, from bottom to top (from leaf nodes to the root node).

At the first stage, as in the Shannon-Fano algorithm, the initial sequence of symbols is sorted in descending order of the frequency of repetition of symbols (elements of the sequence). For the previously discussed example with the word "air crash" we get the following sorted sequence of elements: (a(5), m(2), b(1), u(1), k(1), c(1), p(1), o(1), f(1 )).

Next, the last two elements of the sequence are replaced by a new element S1, which is assigned a recurrence equal to the sum of the recurrences initial elements. Then a new sorting of the elements of the sequence is performed in accordance with their recurrence. In our case, the last two elements o(1) and f(1) are replaced by the element S1(2), and the newly sorted sequence will take the form: (a(5), m(2), S1(2), c(1) , u(1), k(1), c(1), p(1)).

Continuing this procedure of replacing two last elements sequence to a new element with total repeatability and subsequent re-sorting of the sequence in accordance with the repeatability of elements, we will come to a situation where only one element remains in the sequence (Fig. 4).

Rice. 4. Demonstration of the Huffman algorithm
on the example of the word "air crash"

Simultaneously with the replacement of elements and re-sorting of the sequence, a code binary tree is built. The algorithm for constructing a tree is very simple: the operation of combining (replacing) two elements of a sequence generates a new node element on the code tree. That is, if you look at the tree from the bottom up, the edges of the coding tree always come from the replaced elements and converge at a new node element corresponding to the element of the sequence obtained by replacement (Fig. 5). In this case, the left edge of the code tree can be assigned the value "0", and the right edge - "1". These values ​​will later serve as elements prefix code.

Rice. 5. Building a code tree
in the Huffman algorithm
(replacing the elements "o" and "f"
new element S1)

The complete Huffman code tree for a word "air crash" shown in fig. 6.

Rice. 6. Full code tree for the word "air crash",
built according to the Huffman algorithm

Walking along the edges of the code tree from top to bottom, it is easy to get prefix codes for all characters of our informational alphabet:

Now if you try to write a word "air crash" in Huffman encoding, we get a 41-bit sequence 0 1101 11000 0 11001 0 111 0 1010 111 1011 1000 1001 0. It is interesting to note that when using the Shannon-Fano prefix codes, we also get a 41-bit sequence for the word “air crash”. That is, in a specific example, the coding efficiency of Huffman and Shannon-Fano is the same. But if we take into account that the real information alphabet is 256 characters (and not 14, as in our example), and the real information sequences are files of any content and length, then the question arises about the optimal prefix code, that is, the code that allows you to get the minimum length of the output sequence.

It can be proved that the code system obtained using the Huffman algorithm is the best among all possible systems of prefix codes in the sense that the length of the resulting encoded information sequence is minimal. That is, the Huffman algorithm is optimal.

The main disadvantage of the Huffman algorithm is the complexity of the process of constructing a system of codes. Nevertheless, it is the optimal Huffman algorithm that is the most common variable-length code generation algorithm and is embodied in most utilities for compressing and archiving information.

Arithmetic coding

As we have already noted, according to the Shannon criterion, the optimal code is one in which a code of length –log 2 is assigned for each character. p bit. And if, for example, the probability of a certain character is 0.2, then the optimal length of its code will be -log 2 0.2 = 2.3 bits. It is clear that prefix codes cannot provide such a code length, and therefore are not optimal. In general, the code length determined by the Shannon criterion is only a theoretical limit. The only question is which encoding method allows you to get as close as possible to this theoretical limit. Variable length prefix codes are efficient and fairly easy to implement, but there are more effective ways coding, in particular the arithmetic coding algorithm.

The idea of ​​arithmetic coding is that instead of encoding individual characters, the code is assigned simultaneously to the entire information sequence. At the same time, it is obvious that the length of the code per one individual character may not be an integer. That is why this encoding method is much more efficient than prefix encoding and is more consistent with the Shannon criterion.

The idea of ​​the arithmetic coding algorithm is as follows. As in all previously considered encoding methods, each symbol of the original information sequence is characterized by its probability. The original uncoded sequence is assigned a half-interval )

Top Related Articles