2 ways to compress information in files. Compression by encoding series. Email transmission

Good day.
Today I want to touch on the topic of lossless data compression. Despite the fact that there were already articles on Habré devoted to some algorithms, I wanted to talk about this in a little more detail.
I will try to give as mathematical description, and a description in the usual form, so that everyone can find something interesting for themselves.

In this article, I will touch on the fundamentals of compression and the main types of algorithms.

Compression. Is it necessary in our time?

Of course, yes. Of course, we all understand that now we have access to both high-volume storage media and high-speed data transmission channels. However, at the same time, the volumes of transmitted information are growing. If a few years ago we watched 700-megabyte films that fit on one disc, today films in HD quality can take tens of gigabytes.
Of course, there is not much benefit from compressing everything and everyone. But there are still situations in which compression is extremely useful, if not necessary.

Sending documents by e-mail(especially large volumes of documents using mobile devices)
When publishing documents on websites, the need to save traffic
Saving disk space in cases where replacement or addition of storage means is difficult. For example, this happens in cases where it is not easy to knock out a budget for capital expenditures, and there is not enough disk space.

Of course, you can think of many more situations in which compression will be useful, but these few examples are enough for us.

All compression methods can be divided into two broad groups: lossy compression and lossless compression. Lossless compression is used when information needs to be restored with bit precision. This approach is the only one possible when compressing, for example, text data.
In some cases, however, accurate information recovery is not required and algorithms that implement lossy compression are allowed, which, unlike lossless compression, are usually easier to implement and provide a higher degree of archiving.

So, let's move on to considering lossless compression algorithms.

Versatile lossless compression methods

In general, three basic options can be distinguished on which compression algorithms are built.
First group methods - stream transformation. This assumes the description of new incoming uncompressed data through the already processed. In this case, no probabilities are calculated, character encoding is carried out only on the basis of the data that has already been processed, as, for example, in LZ - methods (named after Abraham Lempel and Jacob Ziv). In this case, the second and further occurrences of a certain substring already known to the encoder are replaced with references to its first occurrence.

Second group Methods are statistical compression methods. In turn, these methods are divided into adaptive (or flow), and block.
In the first (adaptive) version, the calculation of the probabilities for new data is based on the data already processed during encoding. These methods include adaptive versions of the Huffman and Shannon-Fano algorithms.
In the second (block) case, the statistics of each data block are calculated separately and added to the most compressed block. This includes static versions of the Huffman, Shannon-Fano, and arithmetic coding methods.

Third group Methods are so-called block conversion methods. The incoming data is split into blocks, which are then transformed in their entirety. At the same time, some methods, especially those based on block permutation, may not lead to a significant (or even any) reduction in the amount of data. However, after such processing, the data structure is significantly improved, and the subsequent compression by other algorithms is more successful and fast.

General principles on which data compression is based

All data compression methods are based on a simple logical principle. If we imagine that the most common elements are encoded with shorter codes, and less common ones - with longer ones, then it will take less space to store all data than if all elements were represented by codes of the same length.
The exact relationship between element frequencies and optimal code lengths is described in the so-called Shannon's source coding theorem, which defines the maximum lossless compression limit and Shannon's entropy.

A bit of math

If the probability of occurrence of an element s i is equal to p (s i), then it will be most advantageous to represent this element - log 2 p (s i) in bits. If during encoding it is possible to achieve that the length of all elements will be reduced to log 2 p (s i) bits, then the length of the entire encoded sequence will be minimal for all possible encoding methods. Moreover, if the probability distribution of all elements F = (p (s i)) is unchanged, and the probabilities of the elements are mutually independent, then the average length of the codes can be calculated as

This value is called the entropy of the probability distribution F, or the entropy of the source at a given moment in time.
However, usually the probability of the appearance of an element cannot be independent, on the contrary, it depends on some factors. In this case, for each new encoded element s i, the probability distribution F will take some value F k, that is, for each element F = F k and H = H k.

In other words, we can say that the source is in state k, which corresponds to a certain set of probabilities p k (s i) for all elements s i.

Therefore, taking into account this amendment, we can express the average length of the codes as

Where P k is the probability of finding a source in state k.

So, at this stage, we know that compression is based on replacing frequently occurring elements with short codes, and vice versa, and we also know how to determine the average length of codes. But what is code, coding, and how does it happen?

Memoryless encoding

Memoryless codes are the simplest codes on the basis of which data compression can be performed. In memoryless code, each character in the encoded data vector is replaced by a codeword from a prefixed set of binary sequences or words.
In my opinion, this is not the clearest definition. Let's consider this topic in a little more detail.

Let some alphabet be given , consisting of a certain (finite) number of letters. Let's call each finite sequence of symbols from this alphabet (A = a 1, a 2, ..., a n) word, and the number n is the length of this word.

Let another alphabet also be given ... Similarly, let us denote a word in this alphabet as B.

Let us introduce two more notation for the set of all non-empty words in the alphabet. Let be the number of non-empty words in the first alphabet, and - in the second.

Let also a mapping F be given, which assigns to each word A from the first alphabet some word B = F (A) from the second. Then the word B will be called code word A, and the transition from the original word to its code will be called coding.

Since a word can also consist of one letter, we can identify the correspondence between the letters of the first alphabet and the corresponding words from the second:
a 1<->B 1
a 2<->B 2
…
a n<->B n

This correspondence is called scheme, and are denoted by ∑.
In this case, the words B 1, B 2, ..., B n call elementary codes, and the type of encoding with their help is alphabetic coding... Of course, most of us have come across this kind of coding, even without knowing everything that I described above.

So, we have defined the concepts alphabet, word, code, and coding... Now we introduce the concept prefix.

Let the word B have the form B = B "B" ". Then B" is called the beginning, or prefix word B, and B "" is its end. This is a fairly simple definition, but it should be noted that for any word B, both some empty word ʌ ("space"), and the word B itself, can be considered both beginnings and ends.

So, we come close to understanding the definition of codes without memory. The last definition that remains for us to understand is the prefix set. The scheme ∑ has the prefix property if for any 1≤i, j≤r, i ≠ j, the word B i is not a prefix of the word B j.
Simply put, a prefix set is a finite set in which no element is a prefix (or start) of any other element. A simple example such a set is, for example, an ordinary alphabet.

So, we figured out the basic definitions. So how does the actual encoding without memory happen?
It takes place in three stages.

An alphabet of Ψ symbols of the original message is compiled, and the alphabet symbols are sorted in descending order of their probability of occurrence in the message.
Each symbol a i from the alphabet Ψ is associated with a certain word B i from the prefix set Ω.
Each character is encoded, followed by combining the codes into one data stream, which will be the results of compression.

One of the canonical algorithms that illustrate this method is the Huffman algorithm.

Huffman algorithm

The Huffman algorithm uses the frequency of occurrence of the same bytes in the input data block, and maps frequently occurring blocks with shorter bitstrings, and vice versa. This code is minimal - redundant. Consider the case when, regardless of input stream, the alphabet of the output stream consists of only 2 characters - zero and one.

First of all, when coding with the Huffman algorithm, we need to build a circuit ∑. This is done as follows:

All letters of the input alphabet are ordered in descending order of probabilities. All words from the alphabet of the output stream (that is, what we will encode) are initially considered empty (recall that the alphabet of the output stream consists only of characters (0,1)).
Two symbols a j-1 and a j of the input stream, which have the lowest probabilities of occurrence, are combined into one "pseudo-symbol" with the probability p equal to the sum of the probabilities of the symbols included in it. Then we add 0 to the beginning of the word B j-1, and 1 to the beginning of the word B j, which will subsequently be the codes of the characters a j-1 and a j, respectively.
We remove these symbols from the alphabet of the original message, but add the generated pseudo-symbol to this alphabet (of course, it must be inserted into the alphabet at the right place, taking into account its probability).

Steps 2 and 3 are repeated until only 1 pseudo-character remains in the alphabet, containing all the original characters of the alphabet. In this case, since at each step and for each character, the corresponding word B i changes (by adding one or zero), then after the completion of this procedure, a certain code B i will correspond to each initial symbol of the alphabet a i.

For a better illustration, consider a small example.
Suppose we have an alphabet consisting of only four characters - (a 1, a 2, a 3, a 4). Let us also assume that the probabilities of these symbols occurrence are, respectively, p 1 = 0.5; p 2 = 0.24; p 3 = 0.15; p 4 = 0.11 (the sum of all probabilities is obviously equal to one).

So, let's build a scheme for a given alphabet.

We combine the two characters with the lowest probabilities (0.11 and 0.15) into a pseudo-character p ".
Combine the two characters with the lowest probability (0.24 and 0.26) into the pseudo-character p "".
Remove the concatenated characters, and insert the resulting pseudo-character into the alphabet.
Finally, we combine the remaining two symbols to get the top of the tree.

If you make an illustration of this process, you get something like this:

As you can see, each time we merge, we assign codes 0 and 1 to the merged characters.
This way, when the tree is built, we can easily get the code for each character. In our case, the codes will look like this:

A 1 = 0
a 2 = 11
a 3 = 100
a 4 = 101

Since none of these codes is a prefix of any other (that is, we got the notorious prefix set), we can uniquely identify each code in the output stream.
So, we have achieved that the most frequent character is encoded by the shortest code, and vice versa.
If we assume that initially one byte was used to store each character, then we can calculate how much we managed to reduce the data.

Let us assume that at the input we had a string of 1000 characters, in which the character a 1 occurred 500 times, a 2 - 240, a 3 - 150, and a 4 - 110 times.

Initially given string occupied 8000 bits. After encoding, we get a string of length ∑p i l i = 500 * 1 + 240 * 2 + 150 * 3 + 110 * 3 = 1760 bits. So, we managed to compress the data by 4.54 times, spending an average of 1.76 bits to encode each character of the stream.

Let me remind you that according to Shannon, the average length of the codes is. Substituting our probabilities into this equation, we get the average code length equal to 1.75496602732291, which is very, very close to our result.
However, it should be borne in mind that in addition to the data itself, we need to store the encoding table, which will slightly increase the final size of the encoded data. Obviously, in different cases, different variations of the algorithm can be used - for example, sometimes it is more efficient to use in advance a given table probabilities, and sometimes - it is necessary to compose it dynamically, by passing through the compressible data.

Conclusion

So, in this article I tried to talk about the general principles by which lossless compression occurs, and also considered one of the canonical algorithms - Huffman coding.
If the article is to the liking of the habro community, then I will gladly write a sequel, since there are still many interesting things related to lossless compression; these are both classical algorithms and preliminary data transformations (for example, the Burrows-Wheeler transformation), and, of course, specific algorithms for compressing audio, video and images (the most interesting topic, in my opinion).

Literature

Vatolin D., Ratushnyak A., Smirnov M., Yukin V. Methods of data compression. The device of archivers, compression of images and videos; ISBN 5-86404-170-X; 2003 r.
D. Salomon. Compression of data, images and sound; ISBN 5-94836-027-X; 2004

Information compression principles

Any information compression method is based on the information source model, or, more specifically, the redundancy model. In other words, to compress information, some information is used about what kind of information is being compressed - without having any information about the information, one cannot make absolutely no assumptions about which transformation will reduce the volume of the message. This information is used in the compression and decompression process. The redundancy model can also be built or parameterized during the compression stage. Methods that allow changing the information redundancy model based on input data are called adaptive. Non-adaptive are usually narrowly specific algorithms used to work with well-defined and unchanged characteristics. The overwhelming majority of the sufficiently universal algorithms are adaptive to one degree or another.

Any information compression method includes two conversions inverse to each other:

compression conversion;
expansion conversion.

Compression transform provides a compressed message from the original. Decompression ensures that the original message (or its approximation) is obtained from the compressed one.

All compression methods are divided into two main classes

no loss,
with losses.

The fundamental difference between the two is that lossless compression provides the ability to accurately reconstruct the original message. Lossy compression allows you to get only some approximation of the original message, that is, different from the original, but within some predetermined errors. These errors should be determined by another model - the model of the receiver, which determines which data and with what accuracy are presented to the receiver, and which are acceptable to discard.

Compression Algorithm Characteristics and Applicability

Compression ratio

Compression ratio is the main characteristic of the compression algorithm, which expresses the main application quality. It is defined as the ratio of the size of uncompressed data to compressed data, that is:

k = S o / S c,

where k- compression ratio, S o is the size of the uncompressed data, and S c - the size of the compressed. Thus, the higher the compression ratio, the better the algorithm. It should be noted:

if k= 1, then the algorithm does not perform compression, that is, it receives an output message with a size equal to the input one;
if k < 1, то алгоритм порождает при сжатии сообщение большего размера, нежели несжатое, то есть, совершает «вредную» работу.

The situation with k < 1 вполне возможна при сжатии. Невозможно получить алгоритм сжатия без потерь, который при любых данных образовывал бы на выходе данные меньшей или равной длины. Обоснование этого факта заключается в том, что количество различных сообщений длиной n Pattern: E: bit is exactly 2 n... Then the number of different messages with length less than or equal to n(if there is at least one message of shorter length) will be less than 2 n... This means that it is impossible to unambiguously match all original messages to a compressed one: either some of the original messages will not have a compressed representation, or several original messages will correspond to the same compressed one, which means they cannot be distinguished.

The compression ratio can be either a constant ratio (some compression algorithms for sound, images, etc., for example, A-law, μ-law, ADPCM), or variable. In the second case, it can be determined either for a specific message, or assessed according to some criteria:

average (usually over some test dataset);
maximum (case of the best compression);
minimal (worst-case compression);

or some other. The lossy compression ratio in this case strongly depends on the permissible compression error or its quality, which usually acts as a parameter of the algorithm.

Loss tolerance

The main criterion for distinguishing between compression algorithms is the presence or absence of losses described above. In general, lossless compression algorithms are universal in the sense that they can be applied to any type of data, while the use of loss compression should be justified. Some types of data do not tolerate any kind of loss:

symbolic data, the change of which inevitably leads to a change in their semantics: programs and their source codes, binary arrays, etc .;
vital data, changes in which can lead to critical errors: for example, obtained from medical measuring equipment or control devices of aircraft, spacecraft, etc.
data repeatedly subjected to compression and decompression: working graphic, sound, video files.

However, lossy compression allows you to achieve much higher compression ratios by discarding insignificant information that does not compress well. So, for example, the lossless audio compression algorithm FLAC, in most cases, allows you to compress audio by 1.5-2.5 times, while the lossy Vorbis algorithm, depending on the set quality parameter, can compress up to 15 times while maintaining acceptable quality. sound.

Algorithm system requirements

Different algorithms may require different amounts of resources computing system on which are performed:

RAM (for intermediate data);
permanent memory (for program code and constants);
processor time.

In general, these requirements depend on the complexity and "intelligence" of the algorithm. As a general trend, the better and more versatile the algorithm is, the more demands it makes on the machine. However, in specific cases, simple and compact algorithms may perform better. System requirements determine their consumer qualities: the less demanding the algorithm, the more simple, and therefore compact, reliable and cheap system it can work.

Since the compression and decompression algorithms work in pairs, the ratio system requirements to them. You can often complicate one algorithm, you can greatly simplify the other. Thus, we can have three options:

The compression algorithm is much more demanding on resources than the decompression algorithm. This is the most common relationship, and it is mainly applicable in cases where once compressed data will be used multiple times. Examples include digital audio and video players. Compression and decompression algorithms have roughly equal requirements. The most acceptable option for a communication line, when compression and decompression occurs once at its two ends. For example, it can be telephony. The compression algorithm is significantly less demanding than the decompression algorithm. Quite an exotic case. It can be used in cases where the transmitter is an ultra-portable device, where the amount of available resources is very critical, for example, a spacecraft or a large distributed network of sensors, or it can be unpacking data that is required in a very small percentage of cases, for example, recording CCTV cameras.

Archiving files. Tasks

Until now, we have been talking about one purpose of archiving data - more economical than using storage media. However, using archiving, you can perform a whole range of tasks:
1. Reducing the volume of files (important not only to save space on the media, but also to quickly transfer files over the network).
2. Backing up to external media for storing important information.

3. Archiving when encrypting data in order to reduce the likelihood of breaking the cryptosystem.

The process of writing information to an archive file is called archiving.
Extracting files from the archive - unzipping.

The first archiving programs appeared in the mid-1980s. They were focused on working in MS-DOC and supported popular archive formats: ARC, ICE, ARJ, ZIP and RAR, etc. There was also a group of archivers that packed data into self-extracting archives - files with. eхе ,. cоm. Resident archivers were created to compress the entire disk. They made it possible to increase the efficiency of disk space utilization by creating large archive files - disk “compressions”.

Working with archives has become much more convenient with the advent of Windows and Windows versions of archivers. From former archive formats among Windows users ARJ, ZIP - programs that unpack files have really taken root. Archive files of large size can be located on several floppy disks (volumes). Such archives are called multivolume.

Tom is component multivolume archive.

Dozens of archiving programs are now used, which differ in the list of functions and operating parameters, but the best of them have approximately the same characteristics. We know that packing and unpacking files is done by the same program, but in some cases this is done different programs, for example, the RKZIP program packs files, and RKUNZIP unpacks the files.
Archiving programs allow you to create such archives, for extracting from which you do not need any programs, since the archive files contain a self-extracting program. Such archives are called SFX archives.

Putting files into an archive: Start Programs WINRAR or as a shortcut on the Desktop.

Universal archiver WINRAR

The WINRAR archiver is also intended for archiving files. It has a convenient graphical interface and supports Drag and Drop technology. WINRAR software allows you to work not only with archive rar files, but also with other archive formats: zip, cab, arj, lzh. WINRAR is launched by any of possible ways provided in Windows. Launching the program using the Main menu of the Start button Programs WINRAR WINRAR or using a shortcut on the desktop.

Test survey on the basics of working with disks.
Homework.
Introspection lesson.

To accomplish these tasks are archiving programs that provide both archiving and data compression. Using special algorithms, archivers remove all redundant information from files, and during reverse unpacking operations they restore information in its original form. The size of the compressed file is two to ten times smaller than the original file.

Nowadays, many users are thinking about how the process of compressing information is carried out in order to save free space on the hard drive, because this is one of the most effective means use of usable space in any drive. Quite often, modern users who are faced with a lack of free space on the drive have to delete any data in order to free up the necessary space, while more advanced users most often use data compression in order to reduce its volume.

However, many do not even know what the process of compressing information is called, not to mention which algorithms are used and what the application of each of them gives.

Should you compress data?

Data compression is quite important today and is necessary for any user. Of course, in our time, almost everyone can purchase advanced data storage devices that provide the ability to use a sufficiently large amount of free space, as well as equipped with high-speed information broadcast channels.

However, in this case, you need to correctly understand that over time, the amount of data that needs to be transferred also increases. And if literally ten years ago the volume of 700 MB was considered standard for an ordinary film, today films made in HD quality can have volumes equal to several tens of gigabytes, not to mention how much free space is occupied by high-quality films. in Blu-ray format.

When is data compression necessary?

Of course, you should not expect that the process of compressing information will bring you much benefit, but there are a number of situations in which some methods of compressing information are extremely useful and even necessary:

Transfer of certain documents via email. This is especially true for those situations when you need to transfer information in large volumes using various mobile devices.
Often the process of compressing information in order to reduce the space it occupies is used when publishing certain data on various sites when it is necessary to save traffic;
Save free space on your hard drive when there is no way to replace or add new storage media. In particular, the most common situation is when there are certain restrictions on the available budget, but there is not enough free disk space.

Of course, in addition to the above, there is also huge number various situations in which the process of compressing information in order to reduce its volume may be required, however, these are by far the most common.

How can data be compressed?

Today, there are a wide variety of methods for compressing information, but they are all divided into two main groups - this is compression with certain losses, as well as compression without loss.

The use of the last group of methods is relevant when data must be recovered with extremely high accuracy, down to one bit. This approach is the only relevant in the event that a certain text document is compressed.

At the same time, it is worth noting the fact that in some situations there is no need for the most accurate recovery of the compressed data, therefore, it is possible to use such algorithms in which the information on the disk is compressed with certain losses. The advantage of lossy compression is that this technology is much easier to implement and also provides the highest possible level of archiving.

Lossy compression

Lossy information provides an order of magnitude better compression, while maintaining sufficient information quality. In most cases, the use of such algorithms is carried out to compress analog data, such as all kinds of images or sounds. In such situations, the unpacked files can be quite different from the original information, but they are practically indistinguishable from the human eye or ear.

Lossless compression

Lossless compression algorithms provide the most accurate data recovery, eliminating any loss of compressed files. However, at the same time, you need to correctly understand the fact that in this case, not so effective file compression is provided.

Generic methods

Among other things, there is a certain series universal methods by which an effective process of compressing information is carried out in order to reduce the space it occupies. In general, there are only three main technologies:

Stream conversion. In this case, the description of the new incoming uncompressed information is carried out through the already processed files, while no probabilities are calculated, but the characters are encoded based solely on those files that have already undergone a certain processing.
Statistical compression. This process of compressing information in order to reduce the space it occupies on disk is divided into two subcategories - adaptive and block methods. The adaptive version provides for calculating the probabilities for new files based on information that has already been processed during the encoding process. In particular, such methods should also include various adaptive variants of the Shannon-Fano and Huffman algorithms. The block algorithm provides for a separate calculation of each block of information with subsequent addition to the most compressed block.
Block transformation. The incoming information is divided into several blocks, and then a holistic transformation takes place. It should be said that certain methods, especially those based on permutation of several blocks, can ultimately lead to a significant reduction in the amount of information being compressed. However, it must be correctly understood that after such processing, a significant improvement ultimately occurs, as a result of which the subsequent compression through other algorithms is much easier and faster.

Copy compression

One of the most important components Reserve copy is the device to move to needed by the user information. The more data you move, the more voluminous device you will need to use. However, if you are going to carry out the process of compressing data, then in this case the problem of lack of free space is unlikely to remain relevant to you.

Why is this needed?

The ability to compress information while significantly reducing the time that will be needed for copying required files and at the same time achieve effective savings in free space on the drive. In other words, when using compression, information will be copied much more compactly and quickly, and you can save your money and finances, which were necessary to buy a larger drive. Among other things, by compressing information, you also reduce the time that will be required when transporting all data to the server or copying it over the network.

Compression of data for backup can be carried out into one or several files - in this case, everything will depend on which program you use and what information you compress.

When choosing a utility, be sure to look at how much your chosen program can compress data. It depends on the type of information, as a result of which the efficiency of compression of text documents can be more than 90%, while the efficiency will be no more than 5%.

As mentioned above, one of the important tasks preliminary preparation data to encryption is to reduce their redundancy and align the statistical patterns of the language used. Partial elimination of redundancy is achieved by compressing data.

Compression of information is the process of converting the original message from one code system to another, as a result of which message size... Algorithms designed to compress information can be divided into two large groups: those that implement lossless compression (reversible compression) and those that implement lossy compression (irreversible compression).

Reversible compression implies absolutely accurate data recovery after decoding and can be used to compress any information. It always leads to a decrease in the volume of the output information flow without changing its information content, that is, without losing information structure... Moreover, from the output stream, using the recovery or decompression algorithm, you can get the input, and the recovery process is called decompression or decompression, and only after the decompression process is the data suitable for processing in accordance with their internal format. Lossless compression is used for text, executable files, high quality audio and graphics.

Irreversible compression has usually a much higher compression ratio than lossless coding, but allows some deviations of the decoded data from the original. In practice, there is a wide range of practical problems in which compliance with the requirement of accurate restoration of the original information after decompression is not mandatory. This, in particular, applies to the compression of multimedia information: sound, photo or video images. For example, JPEG and MPEG multimedia formats, which use irreversible compression, are widely used. Irreversible compression is usually not used in conjunction with cryptographic encryption, since the main requirement for a cryptosystem is the identity of the decrypted data with the original one. However, when using multimedia technologies, the data presented in digital form are often irreversibly compressed before being sent to a cryptographic system for encryption. After the information is transferred to the consumer and decrypted, the multimedia files are used in a compressed form (that is, they are not restored).

Let's take a closer look at some of the most common methods of reversible data compression.

The most famous simple approach and algorithm for reversible compression of information is Run Length Encoding (RLE). The essence of the methods of this approach is to replace strings or series of repeated bytes with one encoding byte-filler and a counter of the number of their repetitions. The problem with all analogous methods is only in determining the way in which the unpacking algorithm could distinguish the encoded series in the resulting byte stream from other, non-encoded byte sequences. The solution to the problem is usually achieved by placing labels at the beginning of the coded strings. Such labels can be characteristic bit values in the first byte of the coded run, the values of the first byte of the coded run. The disadvantage of the RLE method is the rather low compression ratio or the cost of encoding files with a small number of series and, even worse, with a small number of repeated bytes in a series.

With uniform coding of information, the same number of bits is assigned to a message, regardless of the probability of its occurrence. At the same time, it is logical to assume that the total length of transmitted messages will decrease if frequent messages are encoded with short codewords, and rare ones - with longer ones. The problems arising in this case are associated with the need to use variable length codes... There are many approaches to building such codes.

One of the most widely used in practice are dictionary methods, the main representatives of which are the algorithms of the Ziv and Lempel families. Their main idea is that fragments of the input stream ("phrases") are replaced by a pointer to the place where they have already appeared in the text. In the literature, such algorithms are referred to as algorithms LZ compression.

This method quickly adapts to the structure of the text and can encode short functional words, as they appear very often in it. New words and phrases can also be formed from parts of previously encountered words. The decoding of the compressed text is carried out directly - there is a simple replacement of the pointer with a ready-made phrase from the dictionary to which it points. In practice, the LZ-method achieves good compression, its important property is the very fast work of the decoder.

Another approach to compressing information is Huffman code, whose encoder and decoder have a fairly simple hardware implementation. The idea of the algorithm is as follows: knowing the probabilities of occurrence of characters in a message, one can describe the procedure for constructing variable-length codes consisting of an integer number of bits. Symbols are more likely to be assigned more than short codes, while less common characters are longer. This achieves a reduction in the average codeword length and a higher compression efficiency. Huffman codes have a unique prefix (the beginning of the code word), which allows them to be uniquely decoded, despite their variable length.

The synthesis procedure for the classical Huffman code assumes the presence of a priori information about the statistical characteristics of the message source. In other words, the developer must be aware of the probabilities of occurrence of certain symbols from which messages are formed. Let's consider the synthesis of the Huffman code using a simple example.

p (S 1) = 0.2, p (S 2) = 0.15, p (S 3) = 0.55, p (S 4) = 0.1... Let's sort the symbols in descending order of occurrence probability and present them in the form of a table (Fig. 14.3, a).

The code synthesis procedure consists of three main stages. At the first stage, the rows of the table are folded: two rows corresponding to symbols with the lowest probabilities of occurrence are replaced with one with the total probability, after which the table is reordered again. The convolution continues until there is only one row in the table with the total probability equal to one (Fig. 14.3, b).

Rice. 14.3.

At the second stage, a code tree is built using a collapsed table (Fig. 14.4, a). The tree is built starting from the last column of the table.

Rice. 14.4.

The root of the tree forms the unit in the last column. In the example under consideration, this unit is formed from the probabilities 0.55 and 0.45, depicted as two tree nodes associated with the root. The first of them corresponds to the symbol S 3 and, thus, no further branching of this node occurs.

The second node, labeled with a probability of 0.45, connects to two nodes of the third level, with probabilities of 0.25 and 0.2. The probability 0.2 corresponds to the symbol S 1, and the probability 0.25, in turn, is formed from the probabilities 0.15 of the appearance of the symbol S 2 and 0.1 of the appearance of the symbol S 4.

The edges connecting the individual nodes of the code tree are numbered 0 and 1 (for example, the left edges are 0 and the right edges are 1). At the third, final stage, a table is built in which the source symbols and the corresponding Huffman code words are compared. These codewords are formed by reading the numbers that mark the edges that form a path from the root of the tree to the corresponding symbol. For the example under consideration, the Huffman code will take the form shown in the table on the right (Fig. 14.4, b).

However, the classical Huffman algorithm has one significant drawback. To recover the contents of the compressed message, the decoder must know the frequency table used by the encoder. Consequently, the length of the compressed message is increased by the length of the frequency table to be sent in front of the data, which can negate all the efforts to compress the message.

Another variant static Huffman coding consists in viewing the input stream and building an encoding based on the collected statistics. In this case, two passes through the file are required - one for viewing and collecting statistical information, the second for encoding. In static Huffman coding, input symbols (bitstrings of different lengths) are associated with bitstrings of variable length - their codes. The length of the code of each symbol is taken proportional to the binary logarithm of its frequency, taken with the opposite sign. And the common set of all the different symbols encountered makes up the alphabet of the stream.

There is another method - adaptive or dynamic Huffman coding... His general principle is to change the coding scheme depending on the nature of changes in the input stream. This approach has a one-pass algorithm and does not require storing information about the used encoding in an explicit form. Adaptive coding can give a higher compression ratio than static coding, since changes in the frequencies of the input stream are more fully taken into account. When using adaptive Huffman coding, the complexity of the algorithm consists in the need to constantly adjust the tree and codes of the characters of the basic alphabet in accordance with the changing statistics of the input stream.

Huffman methods give a fairly high speed and moderate good quality compression. However, Huffman coding has minimal redundancy, provided that each character is encoded in the character code alphabet with a separate string of two bits - (0, 1). The main drawback this method is the dependence of the compression ratio on the proximity of the probabilities of symbols to 2 to some negative degree, which is due to the fact that each symbol is encoded with an integer number of bits.

A completely different solution is offered by arithmetic coding... This method is based on the idea of converting an input stream to a single floating point number. Arithmetic coding is a technique that allows lossless packing of input alphabet characters, provided that the frequency distribution of these characters is known.

The estimated required sequence of characters when compressed by the arithmetic coding method is considered as some binary fraction from the interval)