lzw
Lempel-Ziv-Welch, or LZW, is a variant of LZ78 by Terry Welch in 1984. Encoding-wise this is simpler than LZ78, since the compression algorithm simply outputs IDs instead of ID-character pair.
The compression algorithm is as follow:
- Maintain a dictionary D.
- Initialize D with all the possible value of a single datum and their corresponding ID.
- Maintain a gensym G that generates IDs starting from the number of all possible values of a single datum. e.g. if a single-datum is a byte (which has 256 different values ranging from 0 to 255), then G should start from 256.
- Maintain a variable P containing the "prefix". In the beginning, the prefix is empty.
- Repeat the following as long as there are characters in the input stream:
- Let variable C pointing to the most current character in the input stream.
- Check if string P + C is in D.
- If yes, then let P be P + C (adding C to the prefix)
- If no:
- Output the ID of P to the output stream according to D. If P is empty, then output 0 instead.
- Add the string P + C to the dictionary with a new ID generated by G.
- Let P be a string containing only a single occurence of C.
- If P still contains characters, output the corresponding ID.
The decompression algorithm is as follow:
- Maintain a dictionary D.
- Initialize D with all the possible value of a single datum and their corresponding ID.
- There must be a first code word in the compressed input. Easy to know that this code word is already in D. Output the corresponding string.
- Maintain a variable PW; initialize it to the first code word. This variable would be referring to "the previous word" throughout.
- Maintain a gensym G that generates IDs starting from the number of all possible values of a single datum. e.g. if a single-datum is a byte (which has 256 different values ranging from 0 to 255), then G should start from 256.
- For each code word in the rest of the compressed input:
- Let CW be this code word;
- Check if CW is in D;
- If the code word is in D;
- Output D[CW];
- Add the string D[PW] + C to D with a new ID from G, where C is the first character of D[CW];
- Or else;
- Output the string D[PW] + C, where C is the first character of D[PW];
- Add the same string to D with a new ID from G;
- If the code word is in D;
- Let PW be of the value of CW.
One naive implementation in Python is thus as follows:
def encode(x): res = [] D = {chr(i):i for i in range(0, 256)} G = 256 P = '' for C in x: if P + C in D: P += C else: res.append(D[P]) D[P+C] = G G += 1 P = C if P: res.append(D[P]) return res def decode(x): res = '' D = {i:chr(i) for i in range(0, 256)} G = 256 PW = x[0] res += D[PW] for CW in x[1:]: if CW in D: res += D[CW] D[G] = D[PW] + D[CW][0] G += 1 else: s = D[PW] + D[PW][0] res += s D[G] = s G += 1 PW = CW return res
Some people may doubt the efficiency of this algorithm; from a test I've done four years ago in 2020, when encoding data (with single datum of width 1 byte) into 12-bit code words it can reduce the size of what was relatively-less-redundancy text to around 43% of the original size.
2024.8.18