lzw

Lempel-Ziv-Welch, or LZW, is a variant of LZ78 by Terry Welch in 1984. Encoding-wise this is simpler than LZ78, since the compression algorithm simply outputs IDs instead of ID-character pair.

The compression algorithm is as follow:

The decompression algorithm is as follow:

One naive implementation in Python is thus as follows:

def encode(x):
    res = []
    D = {chr(i):i for i in range(0, 256)}
    G = 256
    P = ''
    for C in x:
        if P + C in D:
            P += C
        else:
            res.append(D[P])
            D[P+C] = G
            G += 1
            P = C
    if P:
        res.append(D[P])
    return res

def decode(x):
    res = ''
    D = {i:chr(i) for i in range(0, 256)}
    G = 256
    PW = x[0]
    res += D[PW]
    for CW in x[1:]:
        if CW in D:
            res += D[CW]
            D[G] = D[PW] + D[CW][0]
            G += 1
        else:
            s = D[PW] + D[PW][0]
            res += s
            D[G] = s
            G += 1
        PW = CW
    return res

Some people may doubt the efficiency of this algorithm; from a test I've done four years ago in 2020, when encoding data (with single datum of width 1 byte) into 12-bit code words it can reduce the size of what was relatively-less-redundancy text to around 43% of the original size.


2024.8.18

Back