lzw

Lempel-Ziv-Welch, or LZW, is a variant of LZ78 by Terry Welch in 1984. Encoding-wise this is simpler than LZ78, since the compression algorithm simply outputs IDs instead of ID-character pair.

The compression algorithm is as follow:

Maintain a dictionary D.
Initialize D with all the possible value of a single datum and their corresponding ID.
Maintain a gensym G that generates IDs starting from the number of all possible values of a single datum. e.g. if a single-datum is a byte (which has 256 different values ranging from 0 to 255), then G should start from 256.
Maintain a variable P containing the "prefix". In the beginning, the prefix is empty.
Repeat the following as long as there are characters in the input stream:
- Let variable C pointing to the most current character in the input stream.
- Check if string P + C is in D.
  - If yes, then let P be P + C (adding C to the prefix)
  - If no:
    - Output the ID of P to the output stream according to D. If P is empty, then output 0 instead.
    - Add the string P + C to the dictionary with a new ID generated by G.
    - Let P be a string containing only a single occurence of C.
If P still contains characters, output the corresponding ID.

The decompression algorithm is as follow:

Maintain a dictionary D.
Initialize D with all the possible value of a single datum and their corresponding ID.
There must be a first code word in the compressed input. Easy to know that this code word is already in D. Output the corresponding string.
Maintain a variable PW; initialize it to the first code word. This variable would be referring to "the previous word" throughout.
Maintain a gensym G that generates IDs starting from the number of all possible values of a single datum. e.g. if a single-datum is a byte (which has 256 different values ranging from 0 to 255), then G should start from 256.
For each code word in the rest of the compressed input:
- Let CW be this code word;
- Check if CW is in D;
  - If the code word is in D;
    - Output D[CW];
    - Add the string D[PW] + C to D with a new ID from G, where C is the first character of D[CW];
  - Or else;
    - Output the string D[PW] + C, where C is the first character of D[PW];
    - Add the same string to D with a new ID from G;
- Let PW be of the value of CW.

One naive implementation in Python is thus as follows:

def encode(x):
    res = []
    D = {chr(i):i for i in range(0, 256)}
    G = 256
    P = ''
    for C in x:
        if P + C in D:
            P += C
        else:
            res.append(D[P])
            D[P+C] = G
            G += 1
            P = C
    if P:
        res.append(D[P])
    return res

def decode(x):
    res = ''
    D = {i:chr(i) for i in range(0, 256)}
    G = 256
    PW = x[0]
    res += D[PW]
    for CW in x[1:]:
        if CW in D:
            res += D[CW]
            D[G] = D[PW] + D[CW][0]
            G += 1
        else:
            s = D[PW] + D[PW][0]
            res += s
            D[G] = s
            G += 1
        PW = CW
    return res

Some people may doubt the efficiency of this algorithm; from a test I've done four years ago in 2020, when encoding data (with single datum of width 1 byte) into 12-bit code words it can reduce the size of what was relatively-less-redundancy text to around 43% of the original size.

2024.8.18

Back