Run-length encoding is a simple lossless compression technique that can be quite effective for text compression. The objective is to replace consecutive occurrences of a given symbol with only one copy of the symbol, plus a count of how many times that symbol occurs. Hence the name "run-length". For example, the string AAABBCDDDD would be encoded as 3A2B1C4D.
Run-length encoding efficiency depends on the number of repeated character occurrences in the data to be compressed and the average repeated character length. The standard measures of compression efficiency is the compression ratio, which is the ratio of the length of the uncompressed data to the compressed data.
Run-length encoding can be used to compress digital imagery by comparing adjacent pixel values and then encoding only the changes. For images, it is not uncommon that run-length encoding can achieve compression ratios on the order of 8-to-1 for scanned text images. Run-length encoding works well on such files because they often contain a large amount of which space that can be removed. In fact, runlength encoding is the key compression algorithm used to transmit faxes. However, for images with even a small degree of local variation, it is not uncommon for compression to actually increase the image byte size, since it takes 2 bytes to represent a single symbol when that symbol is not repeated.
Differential Pulse Code Modulation
In differential pulse code modulation technique, the first step is to output a reference symbol and then, for each symbol in the data, to output the difference between that symbol and the reference symbol. For example, using symbol A as the reference symbol, the string AAABBCDDDD would be encoded as A0001123333 since A is same as the reference symbol, B has a difference of 1 from the reference symbol, and so on. Note that this simple example does not illustrate the neal benefit of differential pulse code modulation. The real benefit is that when the differences are small, they can be encoded with fewer bits than the symbol itself. In this example, the range of differences 0-3 can be represented with 2 bits each rather than 7 or 8 bits required by the full character. As soon as the difference becomes too large, a new reference symbol is selected.
Differential pulse code modulation works better than run-length encoding for most digital imagery, since it takes advantage of the fact that adjacent pixels are usually similar due to this correction, the dynamic range of differences between the .adjacent pixel values can be significantly less than the dynamic range of original image, and this range can therefore be represented using fewer bits. Using differential pulse code modulation, we have measured compression ratios of 1.5-to-1 on digital images.
A slightly different approach, called delta encoding, simply encodes a symbol as the difference from previous one. Thus, for example, AAABBCDDDD would be represented as A001011000. Note that delta encoding is likely to work well for encoding images where adjacent pixels are similar. It is also possible to perform run-length encoding after delta encoding, since we might find long strings of O's if there are similar symbols next to each other.
Dictionary-based Methods
The last lossless compression method is the dictionary-based approach, of which the Lempal - Ziv (LZ) compression algorithm is the best known. The Unix compress command uses a variation of the LZ algorithm.
The purpose of a dictionary-based compression algorithm is to build a dictionary (table) of variable-length strings (think of them as common phrases) that you expect to find in the data, and then to replace each of these strings when it appears in the data with the corresponding index to the dictionary. For example, instead of working with individual characters in the text data, you could treat each word as a string and output the index in the dictionary for the word. To further elaborate on this example, the word "Compression" has the index 4978 in one particular dictionary; it is the 4978th word in usr/share/dict/words. To compress a body of text, each time the string "Compression"appears it would be replaced by 4978. Since this particular dictionary has just over 25,000 words in it, it would take 15 bits to encode the index, meaning that the string compression could be represented in 15 bits rather than 77 bits required by 7-bit ASCII. This is a compression ratio of 5-to-1.
Of course, this leaves the question of where the dictionary comes from. One option is to define a static dictionary, preferably one that is tailored for the data being compressed.
BACK |