Lesson summary on "data compression and archiving". Data Compression Presentation

27.04.2019 Iron

The proposed materials formed the basis of the test work for the course "Mathematical Foundations of Informatics" (by A. Gein), which I successfully completed in 2011 at the distance university "September 1". The materials are adapted for an extended course in computer science in grade 11.

Download:

Preview:

Data compression Lyceum No. 329

Andreeva O.A.

Lesson plan

Grade 11

Lesson topic: Data compression.

Lesson type : study of new educational material with elements of frontal conversation.

Lesson objectives : expanding competencies in creating your own information space.

Lesson Objectives:

curriculum - consider the concept data compression and familiarize yourself with the algorithms for compressing character data;

cognitive - introduce conceptredundancy of data;

educational - to create conditions for the vigorous activity of each student.

Software

lesson provision:- presentation on the topic "Data compression";

Technical

providing a lesson: - a student's workplace with a PentiumIII PC;

felt-tip board;
projector for demonstration of presentation.

DURING THE CLASSES

Stage I : exit on the topic of the lesson and motivation to study the material;

II stage : teaching material message;

Stage III : actualization of the acquired knowledge - answers to questions for consolidation;

Stage IV : homework message; summing up the lesson.

Lesson summary

Stage I:

The amount of information a person needs is steadily growing. Storage capabilities and bandwidth are also growing. However, the amount of information is growing faster.
There are three ways to solve this problem:

first - limiting the amount of information. Unfortunately, it is not always acceptable. For images, for example, this means a decrease in resolution, which will lead to the loss of fine details and can render the images generally useless (for example, for medical or space images). This path is not applicable for programs and texts.
second - an increase in the volume of information carriers and the capacity of communication channels. This decision is associated with material costs, and sometimes very significant.
third - the use of information compression. This solution allows several times to reduce the requirements for the volume of data storage devices and the bandwidth of communication channels without additional costs (except for the costs of implementing compression algorithms). The conditions for its applicability are redundancy of information and the ability to install special software or hardware to perform these procedures.

A characteristic feature of most types of digital data is their redundancy. The amount of data redundancy depends on the data type. For example, for video data, the degree of redundancy is several times greater than for graphic data, and the degree of redundancy of graphic data, in turn, is greater than the degree of redundancy of text data. Another factor affecting the degree of redundancy is the coding system adopted. The efficiency of coding can be increased when working with a specific set of data.

It is theoretically proved that the redundancy of the literary Russian text is 0.6. In other words, when transmitting text over a communication channel, every six letters out of ten transmitted do not carry any information and can simply not be transmitted without any loss.

Other sources of information have the same, if not higher (ρi = 0.9 ... 0.95) redundancy - speech, and especially music, television images, etc.

Data compression is used in many information systems. Modern radio-technical systems of information transmission and communication simply could not work if this kind of compression was not performed in them. There would be no digital cellular communication of the GSM and CDMA standards. The digital satellite television systems would not work, the Internet would be very ineffective, and there could be no question of watching a video or listening to good music from a laser disc. All this is ensured by efficient or economical coding of information in these systems.

Stage II:

Demonstration of the presentation at the right pace with explanations of the material on each slide.

Slide 5 (additional explanations):

The choice of a non-destructive (lossless) or destructive (lossy) compression system depends on the type of data to be compressed. When compressing text data, computer programs, documents, drawings, etc. it is quite obvious that it is necessary to use non-destructive methods, since it is necessary to absolutely accurately restore the original information after its compression. When compressing speech, music data and images, on the contrary, destructive compression is used more often, since with almost imperceptible distortions it provides an order of magnitude, and sometimes two times lower R ... In the general case, destructive compression provides, as a rule, significantly higher compression ratios than non-destructive.

Slide 18 (additional explanations):

Let's look at the principle of dictionary-based compression. LZ is a data compression technique that takes advantage of repetitive chains of data. The source dictionary for compression contains a dictionary of used symbols. Then, character strings that differ by one character will be added to the dictionary. Chains will lengthen, and more and more often there will be word chains in the text, which will be replaced by links to the dictionary. Words, phrases, lines of text will be added to the dictionary. The compression effect is achieved by encoding repetitive character strings.

Let's imagine this process using the example of phrase compression. The dictionary has already been built, and the strings of interest to us are added to the dictionary (they will be repeated in the phrase). They have digital links. When you re-enter the phrase, they are replaced by links. Obviously, the data is being compressed.

There is a whole family of LZ algorithms that are efficient for different data types.

Conclusion of the II stage:

Thus, the future belongs to new algorithms with high resource requirements and higher and higher compression ratios.

Not only are algorithms becoming obsolete, but also the types of information they are oriented to. Thus, graphics with a small number of colors and unformatted text were replaced by high-quality images and electronic documents in various formats. Known algorithms are not always efficient on new data types. This makes the problem of synthesizing new algorithms extremely urgent.

Preview:

To use the preview of presentations, create yourself a Google account (account) and log into it: https://accounts.google.com

Slide captions:

Data compression

The purpose of data compression is to save resources when storing or transferring data Data compression is the process of reducing the amount of data. Compression methods Modifying data content (reducing data redundancy) Modifying data structure (efficient encoding) Modifying data content and structure Original data Recovered data New format Original format Compressed data Data archiving - full data recovery compression

Compression ratio is a value for the efficiency of the compression method, equal to the ratio of the amount of information before and after compression. Original data Compressed data File size 2MB File size 512 KB K compressed = 2 MB / 0.5 MB = 4

Data compression can occur with loss and without loss Lossless compression (fully reversible) is a data compression method in which data is restored after unpacking it completely without making any changes (used for texts, programs) Kszh up to 50% Lossless compression is data compression methods in which part of the data is discarded and cannot be recovered (used for video, sound, images) Kszh up to 99%

Lossy compression Data type File type after compression Compression ratio Graphics.JPG up to 99% Video.MPG Audio.MP3 Data type File type after compression Compression ratio Graphics.GIF .TIF .PCX Up to 50% Video.AVI Any type.ZIP .ARJ .RAR .LZH Lossless Compression

Character Data Compression Algorithms Statistical methods are compression methods based on statistical processing of text. Dictionary compression is a compression technique based on the construction of an internal dictionary.

Packing homogeneous data 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 _ 1010 + 1011 - 1100, 1101 Encode a message with a length of 16 characters 0.258-23.5 + 18.01 In ASCII encoding, the message is 16 bytes. New code table for packing: Message code after packing is 8 bytes: 000011010 01010101 00011000 0100011 110101011 01100011 00011010 0000001 K comp = 16/8 = 2

The compression ratio increases with the size of the character message; it is necessary to specify a new code table for unpacking; effective only for homogeneous messages using a limited set of characters in the original alphabet; + - - Advantages and disadvantages of the method

Statistical compression method Huffman algorithm Different characters are found in a message with different frequencies, for example, for the Russian alphabet, on average for 1000 characters: space character repetitions: the more often a character occurs, the shorter its code (uneven encoding)

Huffman coding (compression) is a compression technique that assigns variable length codes to characters in the alphabet based on the frequency of occurrence of those characters in the message. character character code space 00 o 01 p 101 to 110 s 0110 f 1001

Decoding problem Example: Let character codes a -10, b -101, c -1010 Decode message 10101011010 10 101 1010 10 10 101 10 10 1010 101 1010 code word.

A prefix code is a code in which no codeword is prefixed to any other codeword. Example of a prefix code: 00 10 010 110 0110 0111 1110 1111 0 0 0 0 0 0 0 1 1 1 1 1 1 1 The prefix code is given by a digraph with marked leaves

Example: build the Huffman code for the phrase FROM_HOOT_HOOPS_DUST_PO_FIELD_LETTER Determine the frequency of occurrence of characters in the phrase: Build a Huffman digraph: -the symbol specifies the vertex of the digraph leaf; -the weight of the vertex is equal to the frequency of occurrence of the symbol; - the pairs of vertices with the least weight are connected: - the left branches are denoted by 0; - the right branches are denoted by 1; -determine the character code from root to leaf; symbol A E I K L O P T S Y _ frequency 1 1 1 1 3 6 5 6 2 1 1 6

TREE ROOT T- O- S- _ P- L- Y- L- E- K- I- A- 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 00011 00010 11000 11001 11011 11010 001 010 011 111 10 0000 Each vertex is indicated by an arrow

Constructed prefix codes of characters: The message in the new codes contains 110 bits, in the ASCII encoding - 34 * 8 = 272 bits then K sr = 272/110 = 2, 47 Symbol A E I K L O P T Y L Y _ Code 11011 11000 11010 11001 001 10 011 010 0000 00011 00010 111 Code length 5 5 5 5 3 2 3 3 4 5 5 3

Huffman's algorithm is universal, it can be used to compress data of any type; The classic Huffman algorithm requires the storage of a coding tree, which increases the file size. + - Advantages and disadvantages of the method

Dictionary Method LZ Compression Algorithm This algorithm was first described by A. Lempel and J. Ziv (Abraham Lempel, Jacob Ziv) in 1977-78, therefore this method is often called Lempel-Ziv, or LZ for short. It is based on the idea of replacing the most frequently encountered character strings (strings) in a file with links to "sample" strings stored in a specially created table (dictionary).

The algorithm was developed by the railing mathematicians Jaco Ziv and Abrakhum Lempel. The dictionary contains, among many others, the following chains: 1-pa 2-ab 3-at 4-mate 5-mi_ 6-am 7-bo 8-th_ 9-bom 10-th 11-lem The algorithm was developed by the Israeli mathematicians 5Yako7 Ziv821x68 L10ne11 Than the longer the chain replaced by a dictionary link, the greater the compression effect.

Applicable for any data; - very high compression speed; - high compression ratio; + - Advantages and disadvantages of the method - the dictionary is configured for the type of text; - the dictionary can be very large;

Lecture number 4. Compression of information

Information compression principles

The purpose of data compression is to provide a compact representation of the data generated by the source for more economical storage and transmission over communication channels.

Suppose we have a file of 1 (one) megabyte in size. We need to get a smaller file from it. Nothing complicated - we run an archiver, for example, WinZip, and we get, say, a file of 600 kilobytes in size. Where did the remaining 424 kilobytes go?

Compressing information is one of the ways to encode it. In general, codes are divided into three large groups - compression codes (effective codes), error-correcting codes and cryptographic codes. Codes designed to compress information are divided, in turn, into lossless codes and lossy codes. Lossless coding implies absolutely accurate data recovery after decoding and can be used to compress any information. Lossy coding usually has a much higher compression ratio than lossless coding, but allows some deviations of the decoded data from the original.

Compression types

All methods of information compression can be conditionally divided into two large non-overlapping classes: compression with loss information and compression no loss information.

Compression without loss of information.

We are primarily interested in these compression methods, since they are used when transferring text documents and programs, when issuing a completed work to a customer, or when creating backup copies of information stored on a computer.

Compression methods of this class cannot allow the loss of information, therefore they are based only on eliminating its redundancy, and information has redundancy almost always (though, if someone has not compressed it before). If there were no redundancy, there would be nothing to compress.

Here's a simple example. In Russian there are 33 letters, ten numbers and about a dozen or so punctuation marks and other special characters. For the text that is written only in uppercase Russian letters(as in telegrams and radiograms) sixty different values would be enough. However, each character is usually encoded by a byte that contains 8 bits and can express 256 different codes. This is the first reason for redundancy. For our "telegraphic" text, six bits per character would be enough.

Here's another example. International character encoding ASCII to encode any character, the same number of bits are allocated (8), while everyone has long known well that it makes sense to encode the most common characters with fewer characters. So, for example, in the "Morse code" the letters "E" and "T", which are often found, are encoded by one character (respectively, a dot and a dash). And such rare letters as "U" (- -) and "C" (- -) are encoded with four characters. Inefficient encoding is the second reason for redundancy. Programs that perform information compression can enter their own encoding (different for different files) and assign a table (dictionary) to the compressed file, from which the unpacking program finds out how certain characters or their groups are encoded in this file. Algorithms based on information transcoding are called Huffman algorithms.

The presence of repetitive fragments is the third reason for redundancy. This is rare in texts, but in tables and graphics, repetition of codes is common. So, for example, if the number 0 is repeated twenty times in a row, then it makes no sense to put twenty zero bytes. Instead, they put one zero and a coefficient of 20. Such algorithms based on detecting repetitions are called methodsRLE (Run Length Encoding).

Graphic illustrations are especially distinguished by large repeating sequences of the same bytes, but not photographic ones (there is a lot of noise and neighboring points differ significantly in parameters), but those that artists paint with a "smooth" color, as in animated films.

Lossy compression.

Lossy compression means that after unpacking the compressed archive, we get a document that is slightly different from the one that was at the very beginning. It is understood that the larger the compression ratio, the larger the loss, and vice versa.

Of course, such algorithms are not applicable for text documents, database tables, and especially for programs. Slight distortions in a simple unformatted text can still be somehow survived, but distortion of at least one bit in the program will make it absolutely inoperative.

At the same time, there are materials in which it is worth sacrificing a few percent of the information in order to get compression tenfold. These include photographic illustrations, videos, and musical compositions. The loss of information during compression and subsequent unpacking in such materials is perceived as the appearance of some additional "noise". But since some "noise" is still present when creating these materials, its small increase does not always look critical, and the gain in file sizes is huge (10-15 times for music, 20-30 times for photo and video materials).

Lossy compression algorithms include such well-known algorithms as JPEG and MPEG. The JPEG algorithm is used to compress photographic images. Graphic files compressed using this method have the JPG extension. MPEG algorithms are used to compress video and music. These files can have different extensions, depending on the specific program, but the most famous are .MPG for video and. MP3 for music.

Lossy compression algorithms are used only for consumer applications. This means, for example, that if a photo is sent for viewing, and music for playback, then similar algorithms can be used. If they are transferred for further processing, for example, for editing, then no loss of information in the original material is acceptable.

The amount of allowable compression loss can usually be controlled. This allows you to experiment and achieve the optimal size / quality ratio. In photographic illustrations intended for display on the screen, the loss of 5% of information is usually not critical, and in some cases, 20-25% can be tolerated.

Lossless compression algorithms

Shannon-Fano code

For further reasoning, it will be convenient to imagine our source file with text as a source of characters that appear one by one at its output. We do not know in advance what character will be next, but we know that the letter "a" will appear with probability p1, the letter "b" will appear with probability p2, and so on.

In the simplest case, we will consider all the characters of the text to be independent of each other, i.e. the probability of the next symbol appearing does not depend on the value of the previous symbol. Of course, this is not the case for a meaningful text, but now we are considering a very simplified situation. In this case, the statement "the symbol carries the more information, the less the probability of its appearance".

Let's imagine a text, the alphabet of which consists of only 16 letters: A, B, C, D, D, E, F, Z, I, K, L, M, N, O, P, R. Each of these signs can be encode with just 4 bits: from 0000 to 1111. Now imagine that the probabilities of these characters occurring are distributed as follows:

The sum of these probabilities is, of course, one. Let's divide these symbols into two groups so that the total probability of symbols of each group is ~ 0.5 (Fig). In our example, these will be groups of characters A-B and G-R. The circles in the figure, denoting groups of symbols, are called nodes or nodes, and the construction of these nodes itself is called a binary tree (B-tree). Let's assign each node its own code, denoting one node with the number 0, and the other with the number 1.

Let us again divide the first group (AB) into two subgroups so that their total probabilities are as close to each other as possible. Let's add the number 0 to the code of the first subgroup, and the number 1 to the code of the second.

We will repeat this operation until there is one symbol left at each vertex of our "tree". The complete tree for our alphabet will have 31 nodes.

The character codes (the rightmost nodes of the tree) have codes of unequal length. So, the letter A, which has a probability of p = 0.2 for our imaginary text, is encoded with only two bits, and the letter P (not shown in the figure), which has a probability of p = 0.013, is encoded with a six-bit combination.

So, the principle is obvious - frequently occurring characters are encoded with fewer bits, rare ones - with more. As a result, the average number of bits per character will be

where ni is the number of bits encoding the i-th symbol, pi is the probability of the i-th symbol.

Huffman code.

Huffman's algorithm gracefully implements the general idea of entropy coding using prefix sets and works as follows:

1. Write out in a row all the symbols of the alphabet in ascending or descending order of the likelihood of their appearance in the text.

2. Consecutively combine two symbols with the lowest probabilities of appearance in a new composite symbol, the probability of which is assumed to be equal to the sum of the probabilities of its constituent symbols. Finally, we will build a tree, each node of which has the total probability of all nodes below it.

3. Trace the path to each leaf of the tree, marking the direction to each node (for example, to the right - 1, to the left - 0). The resulting sequence gives a code word corresponding to each character (Fig.).

Let's build a code tree for a message with the following alphabet:

Disadvantages of the methods

The biggest difficulty with codes, as the previous discussion suggests, is the need to have tables of probabilities for each type of data being compressed. This is not a problem if the English or Russian text is known to be compressed; we simply provide the encoder and decoder with a suitable code tree for English or Russian text. In the general case, when the probability of symbols for the input data is unknown, static Huffman codes work ineffectively.

The solution to this problem is statistical analysis of the encoded data, performed during the first pass over the data, and drawing up a code tree based on it. In this case, the actual encoding is performed in the second pass.

Another drawback of codes is that the minimum codeword length for them cannot be less than one, while the entropy of a message may well be both 0.1 and 0.01 bits / letter. In this case, the code becomes significantly redundant. The problem is solved by applying the algorithm to blocks of symbols, but then the encoding / decoding procedure becomes more complicated and the code tree expands significantly, which must ultimately be stored along with the code.

These codes do not take into account the relationships between characters, which are present in almost any text. For example, if in the text in English we encounter the letter q, then we can confidently say that the letter u will come after it.

Run Length Encoding (RLE) is one of the oldest and simplest archiving algorithms. Compression in RLE occurs by replacing strings of identical bytes with counter-value pairs. (“Red, red, ..., red” is written as “N red”).

One of the implementations of the algorithm is as follows: they search for the least frequent byte, call it a prefix, and replace strings of identical characters with triples "prefix, counter, value". If this byte occurs in the source file once or twice in a row, then it is replaced with a pair of "prefix, 1" or "prefix, 2". There is one unused pair "prefix, 0" that can be used as a sign of the end of packed data.

When encoding exe files, you can search and pack sequences of the form AxAyAzAwAt ..., which are often found in resources (strings encoded in Unicode)

The positive aspects of the algorithm include the fact that it does not require additional memory during operation, and is quickly executed. The algorithm is used in PCX, TIFF, BMP formats. An interesting feature of batch coding in PCX is that the degree of archiving for some images can be significantly increased simply by changing the order of colors in the image palette.

The LZW code (Lempel-Ziv & Welch) is by far one of the most common lossless compression codes. It is with the help of the LZW-code that compression in such graphic formats as TIFF and GIF is carried out, with the help of LZW modifications, many universal archivers perform their functions. The operation of the algorithm is based on searching the input file for repeated sequences of characters, which are encoded with combinations of 8 to 12 bits in length. Thus, this algorithm is most effective on text files and graphic files, which have large monochrome areas or repeating sequences of pixels.

The absence of information loss during LZW coding led to the widespread use of the TIFF format based on it. This format does not impose any restrictions on the size and color depth of the image and is widely used, for example, in the printing industry. Another LZW-based format - GIF - is more primitive - it allows you to store images with a color depth of no more than 8 bits / pixel. At the beginning of the GIF file there is a palette - a table that establishes a correspondence between a color index - a number in the range from 0 to 255 and a true, 24-bit color value.

Lossy compression algorithms

The JPEG algorithm was developed by a group of firms called the Joint Photographic Experts Group. The goal of the project was to create a highly efficient compression standard for both black and white and color images, and this goal was achieved by the developers. Currently, JPEG is widely used where a high compression ratio is required - for example, on the Internet.

Unlike LZW, JPEG encoding is lossy. The encoding algorithm itself is based on very complex mathematics, but in general terms it can be described as follows: the image is divided into squares of 8 * 8 pixels, and then each square is converted into a sequential chain of 64 pixels. Further, each such chain is subjected to the so-called DCT-transform, which is one of the varieties of the discrete Fourier transform. It consists in the fact that the input sequence of pixels can be represented as a sum of sinusoidal and cosine components with multiple frequencies (so-called harmonics). In this case, we only need to know the amplitudes of these components in order to reconstruct the input sequence with a sufficient degree of accuracy. The more harmonic components we know, the less the discrepancy between the original and the compressed image will be. Most JPEG encoders allow you to adjust the compression rate. This is achieved in a very simple way: the higher the compression ratio is set, the fewer harmonics each 64-pixel block will represent.

Of course, the strong point of this type of coding is the high compression ratio while maintaining the original color depth. It is this property that has determined its widespread use in the Internet, where reducing the size of files is of paramount importance, in multimedia encyclopedias, where it is required to store the largest possible amount of graphics in a limited amount.

A negative property of this format is that it degrades the image quality, which cannot be removed by any means. It is this sad fact that does not allow its use in the printing industry, where quality is at the forefront.

However, the JPEG format is not the limit of excellence in striving to reduce the size of the final file. Recently, intensive research has been carried out in the field of the so-called wavelet transform (or burst transform). Based on the most complex mathematical principles, wavelet encoders allow you to get more compression than JPEG, with less information loss. Despite the complexity of the math wavelet transform, in software implementation it is simpler than JPEG. Although wavelet compression algorithms are still in their early stages of development, they have a great future.

Fractal compression

Fractal image compression is a lossy image compression algorithm based on the application of iterable function systems (IFSs, usually affine transformations) to images. This algorithm is known for the fact that in some cases it allows to obtain very high compression ratios (the best examples are up to 1000 times with acceptable visual quality) for real photographs of natural objects, which is not available for other image compression algorithms in principle. Due to the difficult situation with patenting, the algorithm was not widely used.

Fractal archiving is based on the fact that using the coefficients of the system of iterated functions, the image is presented in a more compact form. Before looking at the archiving process, let's take a look at how IFS builds an image.

Strictly speaking, IFS is a set of three-dimensional affine transformations that transform one image into another. Points in three-dimensional space (x coordinate, y coordinate, brightness) are transformed.

The basis of the fractal coding method is the detection of self-similar areas in the image. The possibility of applying the theory of systems of iterable functions (IFS) to the problem of image compression was first explored by Michael Barnsley and Alan Sloan. They patented their idea in 1990 and 1991. Jacquin introduced a fractal coding method that uses a system of domain and range subimage blocks, blocks of a square shape that cover the entire image. This approach became the basis for most of the fractal coding methods in use today. It was refined by Yuval Fisher and a number of other researchers.

This method splits the image into a set of non-overlapping range subimages and defines a set of overlapping domain subimages. For each rank block, the coding algorithm finds the most suitable domain block and an affine transformation that translates this domain block into a given rank block. The structure of the image is mapped into a system of rank blocks, domain blocks and transformations.

The idea is as follows: suppose the original image is the fixed point of some contraction map. Then, instead of the image itself, you can somehow remember this mapping, and to restore it, it is enough to repeatedly apply this mapping to any starting image.

By Banach's theorem, such iterations always lead to a fixed point, that is, to the original image. In practice, the whole difficulty lies in finding the most suitable compression mapping from the image and in its compact storage. Typically, display search algorithms (that is, compression algorithms) are largely exhaustive and computationally intensive. At the same time, the recovery algorithms are quite efficient and fast.

Briefly, the method proposed by Barnsley can be described as follows. The image is encoded with several simple transformations (in our case, affine ones), that is, it is determined by the coefficients of these transformations (in our case, A, B, C, D, E, F).

For example, the image of the Koch curve can be encoded with four affine transformations, we will uniquely define it using only 24 coefficients.

As a result, the point will necessarily go somewhere inside the black area in the original image. Having performed this operation many times, we will fill in all the black space, thereby restoring the picture.

The best known are two IFS images: the Sierpinski Triangle and the Barnsley Fern. The first is given by three, and the second, by five affine transformations (or, in our terminology, lenses). Each transformation is set literally by the read bytes, while the image built with their help can take up several megabytes.

It becomes clear how the archiver works and why it takes so long. In fact, fractal compression is a search for self-similar regions in an image and defining parameters of affine transformations for them.

In the worst case, if the optimizing algorithm is not applied, an enumeration and comparison of all possible image fragments of different sizes will be required. Even for small images, taking discreteness into account, we get an astronomical number of options to be searched. Even a sharp narrowing of the transformation classes, for example, by scaling only a certain number of times, will not allow us to achieve an acceptable time. In addition, image quality is lost. The vast majority of research in the field of fractal compression is now aimed at reducing the archiving time required to obtain a high-quality image.

For the fractal compression algorithm, as well as for other lossy compression algorithms, the mechanisms by which it will be possible to adjust the compression ratio and the loss ratio are very important. To date, a fairly large set of such methods has been developed. First, you can limit the number of conversions, knowingly ensuring the compression ratio is not lower than a fixed value. Secondly, it is possible to require that in a situation where the difference between the processed fragment and its best approximation is higher than a certain threshold value, this fragment must be crushed (for it, several lenses must be installed). Thirdly, you can prohibit splitting fragments smaller than, say, four points. By changing the threshold values and the priority of these conditions, you can very flexibly control the image compression ratio: from bit-to-bit match to any compression ratio.

Comparison with JPEG

Today, the most common graphics archiving algorithm is JPEG. Let's compare it with fractal compression.

First, note that both algorithms operate on 8-bit (grayscale) and 24-bit full color images. Both are lossy compression algorithms and provide similar archiving ratios. Both the fractal algorithm and JPEG have the ability to increase the compression ratio by increasing the loss. Moreover, both algorithms parallelize very well.

The differences begin when we consider the time it takes for the algorithms to zip / unzip. Thus, the fractal algorithm compresses hundreds and even thousands of times longer than JPEG. On the other hand, image unpacking will be 5-10 times faster. Therefore, if the image will be compressed only once, but transmitted over the network and unpacked many times, then it is more profitable to use the fractal algorithm.

JPEG uses the decomposition of the image into cosine functions, so the loss in it (even with a given minimum loss) appears in waves and halos at the border of sharp color transitions. It is for this effect that they do not like to use it when compressing images that are prepared for high-quality printing: there this effect can become very noticeable.

The fractal algorithm eliminates this disadvantage. Moreover, when printing an image, each time you have to perform a scaling operation, since the raster (or ruling) of the printing device does not match the raster of the image. During the conversion, several unpleasant effects can also occur, which can be dealt with either by scaling the image programmatically (for cheap printing devices such as conventional laser and inkjet printers), or by equipping the print device with its own processor, hard drive and a set of image processing programs (for expensive photosetting machines). As you might guess, when using the fractal algorithm, there are practically no such problems.

The displacement of JPEG by the fractal algorithm in widespread use will not happen soon (at least due to the low speed of archiving the latter), however, in the field of multimedia applications, in computer games, its use is quite justified.

Lesson summary in informatics and ICT

A type : Lesson in learning new material

Topic : Archivers. Information compression methods.

Goals :

Explore information compression techniques (packing and Huffman)

Develop algorithmic thinking

Fostering a responsible attitude towards completing the assignment.

Method: Explanatory and illustrative

During the classes:

Organizational moment (2 min)

Knowledge update. (5 minutes)

Explanation of the material and writing in the notebook. (25 minutes)

Initial consolidation of the material (10 min)

Summarizing. (3 min)

Questions on updating knowledge:

How do you understand the concept of "information compression"?

How is digital information compressed?

What programs do you know archivers?

Information in what form requires mandatory compression?

Theoretical material:

In the life of every PC user, situations regularly arise when, for example, you need to transfer one or several files to another computer using a removable media limited in size, send a large file by e-mail, etc. As a rule, there is a problem of splitting a large file into several more small files with the possibility of its further reconstruction, grouping a large number of small files into larger ones, compressing files to reduce their size, etc. To solve such problems, archivers are used.

Archivers are programs that allow you to create, using special compression methods, copies of files of a smaller size and combine copies of several files into one archive file, as well as unpack archives (extract files from an archive).

There are various algorithms for archiving data without losing information, i.e. when unpacking, the data will be restored in its original form. For Windows, the most popular archivers are WinRAR, WinZIP, WinACE.

Compression of information is the process of converting information stored in a file, as a result of which its redundancy is reduced, and accordingly, less memory is required for storage.

Compression of information in files is performed by eliminating redundancy in various ways, for example, by simplifying codes, eliminating constant bits from them, or representing repeating symbols or a repeating sequence of symbols in the form of a repetition factor and corresponding symbols. Various algorithms for such information compression are used.

Packing method

The input text "BELL_BALL_BALL" contains a total of 5 different characters (K, O, L, A, _). Therefore, each symbol can be encoded with three bits. There are 18 characters in total in the original test, so 18X3 = 54 bits are required. Compression ratio is 144/54 = 2.7

Huffman method.

The weak point of the packing method is that the symbols are encoded into bit sequences of the same length. For example, any text containing only the letters "A" and "B" is compressed by the packing method eight times. However, if you add just one letter to such text, for example "C", then the compression ratio will immediately be halved, and regardless of the length of the text and the number of added characters "C"

Improvements to the compression ratio can be achieved by encoding common symbols in short codes, and rare ones in longer ones. This is exactly the idea behind the method published by D. Huffman in 1952.

Huffman algorithm

The characters in the input alphabet form a list of free nodes. Each node has a weight equal to the number of occurrences of a character in the original message.

Two free nodes with the smallest weights are selected from the list.

Their "parent" node is created with a weight equal to the sum of their weights; it is connected to the "children" by arcs.

One arc leaving the "parent" is assigned bit 1, the other - bit 0.

The parent is added to the free list, and two of its children are removed from this list.

The steps starting from the second are repeated until only one free node remains in the free list. He will be considered the root of the tree.

Assignment assignments

Pack the message with 2 methods: Arkhip_osip._Osip_ocryp.

Questions to summarize the lesson:

What is information compression?

The main purpose of archiver programs.

What compression methods have we learned today?

What is the most efficient compression method and why?

Plan:

1 Data compression principles
2 Characteristics of compression algorithms and their applicability
- 2.1 Compression ratio
- 2.2 Loss tolerance
- 2.3 Algorithm system requirements
3 Data compression algorithms of unknown format

Introduction

Data compression(eng. data compression) - algorithmic data transformation performed in order to reduce their volume. It is used for more rational use of storage and data transmission devices. Synonyms - data packing, compression, compression encoding, source encoding. The reverse procedure is called data recovery (unpacking, decompression).

Compression is based on eliminating redundancy in the original data. The simplest example of redundancy is the repetition of fragments in the text (for example, words of natural or machine language). This redundancy is usually eliminated by replacing the repeated sequence with a reference to an already encoded fragment with an indication of its length. Another type of redundancy is associated with the fact that some values in the data being compressed occur more often than others. Reducing the amount of data is achieved by replacing frequently encountered data with short codewords, and rare data with long ones (entropy coding). Compressing data that does not have the property of redundancy (for example, a random signal or noise, encrypted messages) is fundamentally impossible without loss.

1. Principles of data compression

Any compression method is based on the data source model, or more precisely, the redundancy model. In other words, data compression uses some a priori information about what kind of data is being compressed. Without such knowledge of the source, it is impossible to make any assumptions about the transformation that would reduce the size of the message. The redundancy model can be static, unchanged for the entire message being compressed, or it can be built or parameterized during the compression (and recovery) stage. Methods that allow changing the information redundancy model based on input data are called adaptive. Non-adaptive are usually highly specialized algorithms used to work with data with well-defined and unchanging characteristics. The overwhelming majority of fairly universal algorithms are adaptive to one degree or another.

All data compression methods fall into two main classes:

Lossless compression
Lossy compression

When using lossless compression, full recovery of the original data is possible; lossy compression allows you to recover data with distortions, which are usually insignificant from the point of view of the further use of the recovered data. Lossless compression is usually used to transfer and store text data, computer programs, less often - to reduce the volume of audio and video data, digital photographs, etc., in cases where distortion is unacceptable or undesirable. Lossy compression, which is significantly more efficient than lossless compression, is usually used to reduce the volume of audio and video data and digital photographs in cases where such reduction is a priority, and full compliance of the original and recovered data is not required.

2. Characteristics of compression algorithms and their applicability

2.1. Compression ratio

Compression ratio is the main characteristic of a compression algorithm. It is defined as the ratio of the volume of the original uncompressed data to the volume of compressed data, that is:

k = S o / S c,

where k- compression ratio, S o is the amount of initial data, and S c - compressed volume. Thus, the higher the compression ratio, the more efficient the algorithm is. It should be noted:

if k= 1, then the algorithm does not compress, that is, the output message is equal in volume to the input one;
if k < 1, то алгоритм порождает сообщение большего размера, нежели несжатое, то есть, совершает «вредную» работу.

The situation with k < 1 вполне возможна при сжатии. Принципиально невозможно получить алгоритм сжатия без потерь, который при любых данных образовывал бы на выходе данные меньшей или равной длины. Обоснование этого факта заключается в том, что поскольку число различных сообщений длиной n bit is exactly 2 n, the number of different messages with a length less than or equal to n(if there is at least one message of shorter length) will be less than 2 n... This means that it is impossible to unambiguously match all the original messages to a compressed one: either some of the original messages will not have a compressed representation, or several original messages will correspond to the same compressed one, which means they cannot be distinguished.

The compression ratio can be either constant (some algorithms for compressing sound, images, etc., for example, A-law, μ-law, ADPCM, truncated block coding), or variable. In the second case, it can be determined either for each specific message, or assessed according to some criteria:

average (usually over some test dataset);
maximum (the case of the best compression);
minimal (worst-case compression);

or whatever. The lossy compression ratio in this case is highly dependent on the allowable compression error or quality, which usually acts as a parameter of the algorithm. In general, only lossy data compression techniques can provide a constant compression ratio.

2.2. Loss tolerance

The main criterion for distinguishing between compression algorithms is the presence or absence of losses described above. In general, lossless compression algorithms are universal in the sense that their use is certainly possible for any type of data, while the possibility of using lossy compression should be justified. For some types of data, distortions are generally not acceptable. Among them

symbolic data, a change in which inevitably leads to a change in their semantics: programs and their source codes, binary arrays, etc .;
vital data, changes in which can lead to critical errors: for example, obtained from medical measuring equipment or control devices of aircraft, spacecraft, etc .;
Intermediate data repeatedly subjected to compression and restoration during multi-stage processing of graphic, audio and video data.

2.3. Algorithm system requirements

Different algorithms may require a different amount of computing system resources on which they are implemented:

random access memory (for intermediate data);
permanent memory (for program code and constants);
processor time.

In general, these requirements depend on the complexity and "intelligence" of the algorithm. The general trend is as follows: the more efficient and versatile an algorithm is, the greater the demands on computational resources it places on. Nevertheless, in specific cases, simple and compact algorithms can perform as well as complex and universal ones. System requirements determine their consumer qualities: the less demanding the algorithm, the simpler, and therefore, compact, reliable and cheap system it can be implemented.

Since compression and recovery algorithms work in tandem, the ratio of system requirements to them matters. It is often possible by complicating one algorithm to greatly simplify another. Thus, three options are possible:

The compression algorithm requires more computational resources than the recovery algorithm. This is the most common relationship, typical for cases where once compressed data will be used multiple times. Examples include digital audio and video players. Compression and recovery algorithms require approximately equal computational resources. The most acceptable option for communication lines, when compression and restoration occurs once at two ends (for example, in digital telephony). The compression algorithm is significantly less demanding than the recovery algorithm. This situation is typical for cases when the compression procedure is implemented by a simple, often portable device for which the amount of available resources is very critical, for example, a spacecraft or a large distributed network of sensors. It can also be data that needs to be unpacked in a very small percentage of cases, for example, a video surveillance camera recording.

3. Algorithms for compressing data of unknown format

There are two main approaches to compressing unknown format data.

At each step of the compression algorithm, the next compressed character is either placed in the output buffer of the compressing encoder as it is (with a special flag indicating that it has not been compressed), or a group of several compressed characters is replaced with a link to a group of already encoded characters that matches it. Because recovering data compressed in this way is very fast, this approach is often used to create self-extracting programs.
For each sequence of characters being compressed, statistics of its occurrence in the encoded data is collected once or at each moment of time. Based on this statistics, the probability of the value of the next encoded character (or sequence of characters) is calculated. After that, one or another kind of entropy coding is applied, for example, arithmetic coding or Huffman coding, to represent frequently occurring sequences in short codewords, and rarely occurring ones in longer ones.

Literature

D. Vatolin, A. Ratushnyak, M. Smirnov, V. Yukin. Data compression methods. The device of archivers, compression of images and videos. - Dialogue-MEPhI, 2002 .-- P. 384 .-- ISBN 5-86404-170-X 3000 copies
D. Salomon. Compression of data, images and sound. - M .: Technosphere, 2004 .-- S. 368 .-- ISBN 5-94836-027-X 3000 copies

Slide 2

For long-term storage of data on various storage media
For data transmission via communication channels

Slide 3

Redundancy of data

Most of the data is redundant
Redundancy improves the perception and processing of information
During storage, redundancy is reduced
Video information has the greatest redundancy, followed by graphic, audio, and the lowest redundancy in text information

Slide 4

Compression methods

With partial loss of information: Produced by compressing the image, video and sound code This possibility is associated with the subjective capabilities of human vision and hearing.
Without loss of information: - use of uneven symbolic code; - identification of repetitive code fragments.

Slide 5

With partial loss

Vision is more influenced by the brightness of a pixel than by its color. Therefore, the amount of video code can be reduced due to the fact that color codes are stored not for each pixel, but after one, two, etc. pixels of the raster. The larger the gaps, the more the video data is compressed, but the image quality deteriorates.
When coding video films - a dynamic image, the property of inertia of vision is taken into account. Rapidly changing portions of a movie can be encoded in less detail than still frames.
The most difficult thing to compress is audio code. It also uses the psychophysiological characteristics of human hearing. It takes into account which harmonics of natural sound our hearing is more susceptible to, and to which - less. Poorly perceived harmonics are filtered out by mathematical processing. Compression is also facilitated by taking into account the nonlinear relationship between the amplitude of sound vibrations and our ear's perception of sound loudness.

Slide 6

It is used for such data types for which the formal loss of a part of the content does not lead to loss of consumer properties and provides a high compression ratio.
Examples: MPG video, MP3 audio, JPG drawings.

Slide 7

No loss - "reversible"

Applies to texts, databases, and all the other types mentioned above.
Example: pictures - GIF, TIF, PCX, video - AVI, any data type - ZIP, ARJ, RAR, etc.

Slide 8

The archives are used for the purpose

increase the efficiency of the medium - place more information on one medium
creating backup copies of valuable data, which will be stored in a compressed form on separate media.
protect data from unauthorized access with a password - documents won't even open
increasing the speed of copying data from disk to disk, for example, electronic pages containing many small graphic files
fast recovery of data modified by the user
transmission of information through communication channels
data fragmentation into packages

Slide 10

Capabilities of archivers

Viewing archive contents

Data integrity control

Unpacking the archive

Repairing a damaged archive

Setting protection

Adding a file to the archive

Creation of multivolume archives

Creating self-extracting archives

Blocking against accidental modification

Slide 11

Self-extracting

(SFX, from English SelF-eXtracting) is an archive to which an executable module is attached. This module allows you to extract files by simply running the archive like a normal program. Thus, no additional external programs are required to extract the contents of the SFX archive. SFX archives are useful when you need to transfer an archive to someone, but you are not sure that the recipient has the appropriate archiver to unpack it.

Lesson summary on "data compression and archiving". Data Compression Presentation

Download:

Preview:

DURING THE CLASSES

Preview:

Slide captions:

Plan:

Introduction

1. Principles of data compression

2. Characteristics of compression algorithms and their applicability

2.1. Compression ratio

2.2. Loss tolerance

2.3. Algorithm system requirements

3. Algorithms for compressing data of unknown format

Literature

Redundancy of data

Compression methods

With partial loss

No loss - "reversible"

Archives

The archives are used for the purpose

Capabilities of archivers

Self-extracting

Top related articles