lz78

LZ78 is a lossless compression algorithm by Abraham Lempel and Jacob Ziv in 1978. The compression algorithm goes as follows:

Maintain a dictionary D, containing found patterns & their IDs.
Maintain a gensym G that generates IDs starting from 1.
Maintain a variable P containing the "prefix". In the beginning, the prefix is empty.
Repeat the following as long as there are characters in the input stream:
- Let variable C pointing to the most current character in the input stream.
- Check if string P + C is in D.
  - If yes, then let P be P + C (adding C to the prefix)
  - If no:
    - Output the ID of P to the output stream according to D. If P is empty, then output 0 instead.
    - Output C as well.
    - Add the string P + C to the dictionary with a new ID generated by G.
    - Let P be empty.
If P still contains characters, output the corresponding ID.

At the end of the algorithm, the output should be a list of ID-character pair. The decompression algorithm is thus as follows:

Maintain a directory D, containing IDs and their corresponding patterns.
Maintain a gensym G that starts from 1.
For each ID-character pair in the input:
- If the ID is 0, output the character.
- Or else, look up the ID in D, output the corresponding string, and then output the character.
- Add the string P+C to D with a new ID from G, where P is the string the ID represents in D (or empty string, if the ID is 0) and C is the character.

A simple implementation in Python is thus as follows:

def encode(x):
    res = []
    d = {'': 0}
    gensym = 1
    p = ''
    for ch in x:
        if p+ch in d:
            p += ch
        else:
            res.append((d[p], ch))
            d[p+ch] = gensym
            gensym += 1
            p = ''
    if p:
        res.append((d[p], ''))
    return res

def decode(x):
    res = ''
    d = {0:''}
    gensym = 1
    for c, ch in x:
        res += d[c]+ch
        d[gensym] = d[c]+ch
        gensym += 1
    return res

On the Wikipedia page LZ78 doesn't even get a pseudocode part. Some people are truly over-estimating their ability of explaining things properly.

A few observations can be made:

Intuitively, for all the repeating pattern that occurs in the original input, a prefix of the pattern occurs just as often as the pattern itself, if not more often and more deserving of its own encoding.
There is at most N ID-character pairs where the ID is 0, (N is the number of distinct character in the input,) since every new character is introduced once and added to the dictionary, which makes them only referred as dictionary ID after.
One would naturally need to support a dictionary of a capacity that's bigger than the number of possibilities of datum, e.g. it would be better to support a dictionary of more than 256 entries if one datum is one byte.

A simpler variant named LZW exists; it's simpler encoding-wise and much more popular.

2024.8.17

Back