Open Library is an open library of educational information. Information theory

06.05.2019 Safety

Now there are many algorithms for compressing information. Most of them are widely known, but there are also some very effective, but, nevertheless, little-known algorithms. This article talks about the arithmetic coding method, which is the best of the entropy ones, but nevertheless very few people know about it.
Before talking about arithmetic coding, I must say a few words about the Huffman algorithm. This method is effective when the symbol rates are proportional to 1/2 n (where n is a positive positive integer). This statement becomes obvious when you remember that the Huffman codes for each character always consist of an integer number of bits. Let us consider the situation when the frequency of occurrence of a symbol is 0.2, then the optimal code for encoding this symbol should have a length of –log 2 (0.2) = 2.3 bits. It is clear that the Huffman prefix code cannot have this length, i.e. this ultimately leads to poor data compression.
Arithmetic coding is designed to solve this problem. The main idea is to assign codes not to individual characters, but to their sequences.
First, we'll look at the idea behind the algorithm, then we'll look at a small practical example.
As in all entropy algorithms, we have information about the frequency of use of each character of the alphabet. This information is the source for the considered method. Now let's introduce the concept of a flange segment. A semi-interval is called a worker.How many three-digit hexadecimal numbers exist for which there will be at the same time

Lecture 13 Techniques and methods for working with compressed data Lecturer Art. teacher Coupeau A.N. A characteristic feature of most of the "classic" data types that people traditionally work with is a certain

Federal State Educational Budgetary Institution of Higher Professional Education Povolzhskiy State University of Telecommunications and Informatics Department of SARS Assignment and methodological

UDC 519.6 Peculiarities of text coding using the Huffman algorithm Kizyanov Anton Olegovich Priamursky State University named after Sholem Aleichem Student Kuzmina Bogdana Sergeevna Priamursky

LABORATORY WORK Methods of setting and basic characteristics of convolutional codes Convolutional codes are widely used in various fields of information transmission and storage technology. The most illustrative

UDC 004.8 APPLICATION OF A GENETIC ALGORITHM FOR MANAGING THE DESIGN OF SCHOOL SCHEDULE Gushchina O. A. Genetic algorithm (GA) adaptive search algorithm based on evolutionary factors

Discrete mathematics Part 2 Kochetov Yuri Andreevich http://www.math.nsc.ru/lbrt/k5/dm.html Lecture 1 Algorithms, sorting, AVL trees 1 Algorithms and their complexity Computers perform (so far) only correctly

Huffman's method is simple, but effective only when the probabilities of occurrence of symbols are equal to numbers, where is any positive integer. This is because the Huffman code assigns an integer number of bits to each character in the alphabet. At the same time, it is known in information theory that, for example, when the probability of the appearance of a symbol is equal to 0.4, it should ideally be given a code of length bit. It is clear that when constructing Huffman codes, the code length cannot be set to 1.32 bits, but only to 1 or 2 bits, which will result in poor data compression. Arithmetic coding solves this problem by assigning a code to the entire, usually large, transmitted file, instead of encoding individual characters.

The idea of arithmetic coding is best illustrated with a simple example. Suppose that it is necessary to encode three characters of the input stream, for definiteness, this is the SWISS_MISS string with the specified frequencies of the characters: S - 0.5, W - 0.1, I - 0.2, M - 0.1 and _ - 0, one. In an arithmetic encoder, each character is represented by an interval in a range of numbers), and the interval for the i-th encoded stream symbol as; B [c]), including 0.341. By enumerating all possible symbols according to the table above, we find that only the interval is (fti-j - li-i); hi= li.!+ b ■ (hi.! - li.i); if ((l t<= value) && (value < hi)) break; ); DataFile.WriteSymbol (c ^);

where value is the number (fraction) read from the stream, and With - unpacked characters written to the output stream. When using an alphabet of 256 characters Cj the inner loop takes a long time to complete, but you can speed it up. Note that since B [c ^ (\ = a; II delitel = 10

First_qtr - (h 0 + l) / 4; // - 16384

Half = First_qtr * 2; // - 32768

Third_qtr - First_qtr * 3; // = 49152

bits_to_follow = 0; // How many bits to clear

while (not DataFile.EOFO) (

c = DataFile.ReadSymbol (); // Read the symbol
j= IndexForSymbol (s); i ++; // Find its index
li= li.j + b * (h i. 1 - li-x+ l) / delitel;
hi= li.!+ b;
First_qtr = (h 0 + l) / 4; // = 16384

Half = First_qtr * 2; // = 32768

Third_qtr = First_qtr * 3; // = 49152

value = CompressedFile.Readl6Bit ();

for (i = l; i< CompressedFile.DataLengthO; i++){

freq = ((value-2 i. 1 + l) * delitel-l) / (h i. I - 1 ±. X+ 1) ;

for (j = l; b<=freq; j++); // Поиск символа

li = 1y+ blj-l] * (bi.!- li- u + l) / delitel;

hi= Im+ b * (h i. 1 - li.!+ l) / delitel - 1;

for (;;) (// Process options

if (hi< Half) // переполнения

; // Nothing else ifdi> = Half) (

2i- = Half; hi- = Half; value- = Half; )

else if (di> = First_qtr) && (hi< Third_qtr)) { 2i-= First_qtr; hi-= First_qtr; value-= First_qtr,-} else break; 2i+=2 i; hi+= hi+1;

value + = value + CompressedFile.ReadBit (); } DataFile.WriteSymbol (c););

The exercise. Suggest examples of sequences compressed by the algorithm with the maximum and minimum coefficients.

As you can see, we are struggling with inaccuracies of arithmetic by performing separate operations on /, and A, synchronously in the compressor and decompressor.

Insignificant loss of accuracy (fractions of a percent for a sufficiently large file) and, accordingly, a decrease in the compression ratio in comparison with the ideal algorithm occurs during the division operation, when the relative frequencies are rounded to an integer, when the last bits are written to the file. The algorithm can be accelerated by representing the relative frequencies so that the divisor is a power of two (i.e., replacing division with a bitwise shift operation).

In order to estimate the compression ratio by the arithmetic algorithm of a particular string, you need to find the minimum number N, such that the length of the working interval when compressing the last character of the chain would be less than 1/2 ^ .. This criterion means that within our interval there is certainly at least one number in the binary representation of which after N-ro there will be only 0 signs. The length of the interval is easy to read, since it is equal to the product of the probabilities of all symbols.

Consider the earlier example of a two-character string l and B with probabilities 253/256 and 3/256. Length of the last working interval for a 256 character string a and B with the indicated probabilities equal. It is easy to calculate that the required N = 24 (1/2 24 = 5.96-10 "8), since 23 gives too large an interval (2 times wider), and 25 is not minimal a number that meets the criterion. It was shown above that the Huffman algorithm encodes this string in 256 bits. That is, for the considered example, the arithmetic algorithm gives a tenfold advantage over the Huffman algorithm and requires less than 0.1 bits per character.

The exercise. Calculate the estimate of the compression ratio for the line "COV.KOROBA".

A few words should be said about the adaptive arithmetic compression algorithm. Its idea is to rebuild the probability table b [f] during packing and unpacking immediately upon receipt of the next character. Such an algorithm does not require storing the values of the probabilities of characters in the output file and, as a rule, gives a high compression ratio. So, for example, a file like a 1000 £ 1000 with 1000 b / 1000 (where the degree means the number of repetitions of a given character) the adaptive algorithm will be able to compress more efficiently than spending 2 bits per character. The above algorithm simply turns into an adaptive one. Previously, we saved the table of ranges to a file, and now we calculate right in the course of the compressor and decompressor, recalculate the relative frequencies, adjusting the range table in accordance with them. It is important that changes in the table take place in the compressor and decompressor synchronously, i.e., for example, after coding chains of length 100 the range table should be exactly the same as after decoding chains of length 100. This condition is easy to fulfill if you change the table after encoding and decoding the next character. For more details on adaptive algorithms, see Ch. 4.

Arithmetic algorithm characteristics:

Best and worst compression ratio: best> 8 (encoding less than a bit per symbol is possible), worst - 1.

Pros of the algorithm: Provides a better compression ratio than Algo-I Huffman rhythm (on typical data by 1-10%).

Characteristics: as well as Huffman coding, it does not increase the size of the original data in the worst case.

Interval coding

Unlike the classical algorithm, interval coding assumes that we are dealing with discrete integer values that can take on a limited number of values. As already noted, the initial interval in integer arithmetic is written as [OD) or, where N- the number of possible values for the variable used to store the bounds of the interval.

To compress data most efficiently, we must encode each character s via -log 2 (Ј) bits, where f,- symbol frequency s. Of course, in practice, such accuracy is unattainable, but we can for each character s assign a range of values in an interval , Prev_freq [c], 10);

Result

Normalization

Normalization

Normalization

As already noted, most often no carryover occurs during normalization. Proceeding from this, Dmitry Subbotin 1 proposed to abandon the transfer altogether. It turned out that the loss in compression is quite insignificant, on the order of a few bytes. However, the speed gain was also not very noticeable. The main advantage of this approach is the simplicity and compactness of the code. This is what the normalization function looks like for 32-bit arithmetic:

♦ define CODEBITS 24

♦ define TOP (l "CODEBITS)

♦ define BOTTOM (TOP "8)

♦ define BIGBYTE (0xFF “(CODEBITS-8))

void encode_normalize (void) (while (range< BOTTOM) {

if (low & BIGBYTE == BIGBYTE &&

range + (low & BOTTOM-1)> = BOTTOM) range = BOTTOM - (low & BOTTOM-1); output_byte (low "24); range<<=8; low«=8; })

It can be noted that the timely forced reduction of the value of the interval size allows us to avoid the transfer. It happens

when the second most significant byte low is set to OxFF, and when the value of the range size is added to low, a carry occurs. This is the optimized normalization procedure:

void encode_normalize (void) (while ((low "low + range) } }

void decode_normalize (void) (while ((low and low + range) }

The exercise. Apply spaced encoding without line breaks "cow. cow".

2.3 Arithmetic coding

Algorithms Shannon-Feno and Huffman at best they cannot encode each message character with less than 1 bit of information. Suppose in a message consisting of 0 and 1, ones are 10 times more common. The entropy of such a message Hx≈0,469 (bit/Sim). In such a case, it is desirable to have a coding scheme, allowing to encode message characters less than 1 bit of information. One of the best algorithms for such information coding is arithmetic coding.

According to the initial probability distribution of the d.s.v. a table is constructed consisting of segments intersecting at the boundary points for each of the d.s.v. values. The union of these segments should form an interval , and their lengths are proportional to the probabilities of the encoded values.

The coding algorithm consists in constructing a segment that uniquely identifies a specific sequence of message symbols. As the input characters arrive, the message segment gets narrower. The segments are constructed as follows. If there is a message segment of lengthn -1 , then to construct a segment of a message with lengthn characters, the previous interval is divided into as many parts as the values include the alphabet of the source. The beginning and end of each new interval of the message is determined by adding to the beginning of the previous interval the product of its width by the values of the boundaries of the segment corresponding to the current new character (according to the original table of probabilities of symbols and their assigned intervals)... Then, from the obtained segments, one is selected that corresponds to a specific sequence of message characters with lengthn .

For the constructed segment of the message, there is the number belonging to this segment, usually, it is an integer divided by the smallest possible power of 2. This real number will be the code for the sequence in question.... All possible codes are numbers strictly greater than 0 and less than 1, so the leading 0 and decimal point are not counted.

As the source text is encoded, its interval narrows; accordingly, the number of bits used to represent it increases. The successive characters of the input text reduce the width of the segment depending on their probabilities. More probable symbols narrow the spacing less than less probable ones, and therefore add fewer digits to the result.

The fundamental difference arithmetic coding from compression methods Shannon-Feno and Huffman in its continuity, i.e. in the absence of the need to block the message. The efficiency of arithmetic coding grows with an increase in the length of the message being compressed, but it requires large computational resources.

Let us explain the idea of arithmetic coding with specific examples.

Example 1 Let's encode a text string « MATHEMATICS »According to the arithmetic coding algorithm.

The alphabet of the encoded message contains the following characters: ( M , A , T , E , AND , TO }.

Let us determine the frequency of each of the symbols in the message and assign each of them a segment, the length of which is proportional to the probability of the corresponding symbol ( tab. 2.7).

Symbols in the table of symbols and intervals can be arranged in any order: as they appear in the text, in alphabetical order, or in ascending order of probabilities - this does not matter. The encoding result may be different, but the effect will be the same.

Table 2.7

Symbol

Probability

Interval

M

0,2

11

(8/9; 1 ]

111

(26/27; 1 ] " 31/32 ®

110

(8/9; 26/27 ] " 15/16 ®

10

(2/3; 8/9 ]

101

(22/27; 8/9 ] " 7/8 ®

100

(2/3; 22/27 ] " 3/4 ®

0

(0 ; 2/3 ]

01

(4/9; 2/3 ]

101

(16/27; 2/3 ] " 5/8 ®

100

(4/9; 16/27 ] " 1/2 ®

00

(0; 4/9 ]

001

(8/27; 4/9 ] " 3/8 ®

000

(0; 8/27 ] " 1/4 ®

Average code length per message unit

Here is the arithmetic coding procedure for a sequence of arbitrary length:

While there are still input symbols do

get an input symbol

code_range = high - low.

high = low + code_range * high_range (symbol)

low = low + code_range * low_range (symbol)

Decoder , as well as the encoder, the table of distribution of the segments allocated to the symbols of the alphabet of the source is known. Decoding of the message arithmetic code is performed according to the following algorithm:

Step 1 The interval containing the current message code is determined from the table of segments of the alphabet characters - and the symbol of the original message is uniquely determined from this interval from the same table. If this is a marker of the end of the message, then the end, otherwise - go to step 2.

Step 2 The lower bound of the interval that contains it is subtracted from the current code. The resulting difference is divided by the length of this interval. The resulting value is considered the new current code. Go to step 1.

Consider an example of decoding a message compressed using the arithmetic coding algorithm.

Example 3 The length of the original message is 10 characters. Binary arithmetic code for message 000101000001100001011111 2 = 1316259 10 .

A real number belonging to the interval that uniquely identifies the encoded message, ... This number will be current code messages.

According to the original table of d.s.v. values. and the intervals assigned to them ( table 2.7 ) the segment to which this number belongs is determined, - }

Symbol	Probability	Interval
M	*0,2*		11	(8/9; 1 ]	*111*	(26/27; 1 ] " *31/32* ®
*110*	(8/9; 26/27 ] " *15/16* ®
10	(2/3; 8/9 ]	*101*	(22/27; 8/9 ] " *7/8* ®
*100*	(2/3; 22/27 ] " *3/4* ®
0	(0 ; 2/3 ]	01	(4/9; 2/3 ]	*101*	(16/27; 2/3 ] " *5/8* ®
*100*	(4/9; 16/27 ] " *1/2* ®
00	(0; 4/9 ]	*001*	(8/27; 4/9 ] " *3/8* ®
*000*	(0; 8/27 ] " *1/4* ®

Open Library is an open library of educational information. Information theory

Top related articles