lz77

LZ77 is a lossless compression algorithm, invented by Abraham Lempel and Jacob Ziv in 1977. The basic idea is that when redundancy in input data happens (i.e. same thing happening for a second or third or etc. times) one could refer back to it its previous occurence instead of repeating that same thing again. Normally we will have a hard limit on how far back we can look and refer because it wouldn't be very reasonable for us to allow "infinite" look-back. (This is where all the "sliding-window" talk comes from. We will also refer to this limit as "the maximum window size" or simply "the window size".)

From the basic idea mentioned above the compression output naturally would have two different kinds of commands: "literal"-type commands that ask the decompression algorithm to directly copies data to the output, and "copy"-type commands which ask the decompression algorithm to copy the substring N characters back starting at the last N character in the output (why output? Because the output of a decompression algorithm is the input the compression algorithm originally takes.)

The decompression process is thus as follows:

We shall use the ULZ format from Devine Lu Linvega as an example. In the format, there are one "literal" command and two "copy" command:

Other file formats that uses LZ77 would of course have different command formats than this one, for the obvious reason of the compression rate of the algorithm depends on the input, the window size, the kind of redundancy that you choose to eliminate, how you encode and represent the two kinds of commands and other aspects. Notice that in the format above the length field of CPY1 and CPY2 represents 4 at minimum; this means that when compressing, redundancy of length strictly smaller than 4 is ignored. This is because redundancies of length smaller than 4 don't make sense to be encoded as a CPY1 or CPY2 command since such action costs at least 2+1 bytes (the last 1 byte being the cost of starting another LIT command).

The code for decompression is thus as follows:

def decode(x):
    # cyclic buffer of length 256.
    window = [0 for _ in range(256)]
    window_i = 0
    x_len = len(x)
    # `i` is the index into the input.
    i = 0
    while i < x_len:
        # determining the type of the command)
        b = x[i]
        if not (b & 0b10000000):
            # it's a LIT command.
            b = b & 0b01111111
            # data starts from the next byte of input.
            i += 1
            for _ in range(b+1):
                # output the byte
                yield x[i]
                # push it back to the buffer
                window[window_i] = x[i]
                window_i = (window_i + 1) % 256
                i += 1
        elif (b & 0b10000000):
            # it's a CPY1 or CPY2 command. they both have a length and an
            # offset field, so we combine the handling of both here.
            if not (b & 0b01000000):
                i += 1
                l = b & 0b00111111
            else:
                i += 2
                l = ((b & 0b00111111) << 8) | x[i-1]
            l += 4
            off = x[i]
            # the copying.
            for _ in range(l):
                bi = (window_i - off - 1 + 256) % 256
                # output the byte
                yield window[bi]
                # push back into the buffer
                window[window_i] = window[bi]
                window_i = (window_i + 1) % 256
            # next command.
            i += 1

The pseudocode for LZ77 on the Wikipedia is absolutely horrible:

while input is not empty do
    match := longest repeated occurrence of input that begins in window

    if match exists then
        d := distance to start of match
        l := length of match
        c := char following match in input
    else
        d := 0
        l := 0
        c := first char of input
    end if

    output (d, l, c)

    discard l + 1 chars from front of window
    s := pop l + 1 chars from front of input
    append s to back of window
repeat

The hardest part of LZ77 is simply described as "longest repeated occurrence of input that begins in window" and "match exists". While they are technically not wrong, they don't provide any useful information about how to implement them either. I shall describe the compression process in much more details here.

The compression process goes as follows:

The time complexity amounts to O(N2*M), where N is the possible length of the input and M is the maximum window size. There is probably a better ("better", as in having a better time complexity) way to do this, but it's rarely worth the hassle.

The code for compression is thus as follows:

# NOTE THAT this is a slightly different version from the algorithm above,
# but the spirit is the same.
# we will do it in two phases: turning the input into commands, and turning
# the commands into byte sequences. of course, one could directly converts
# the generated command into byte sequences and thus complete the process
# in one single phase.
def encode_n(x):
    # init
    cnt = 0
    x_len = len(x)
    i = 0
    # first byte; this must be included in a LIT because there's no char to copy.
    cmd_list = []
    current_cmd = None
    while i < x_len:
        if current_cmd and current_cmd[1] >= 127:
            cmd_list.append(current_cmd)
            current_cmd = None
        if current_cmd is None:
            current_cmd = [0, 0, [x[i]]]
            cnt += 1
            i += 1
            continue
        # get first rep
        max_possible_off = min(cnt, 256)
        j = 1
        stp = i - j
        max_j = 0
        max_k = 0
        while j <= max_possible_off:
            stp = i - j
            this_match_len = 0
            k = 0
            while (
                stp+k < x_len
                and i+k < x_len
                and x[stp+k] == x[i+k]
            ):
                k += 1
            this_match_len = k
            if this_match_len > max_k:
                max_k = this_match_len
                max_j = j
            j += 1
        if max_k <= 3:
            # cannot find first rep; still LIT.
            # match of 1 characters are meaningless because to encode copy you need more than 1 byte
            # match of 2 or 3 characters are also meaningless because while such copy does not cost
            # more than LIT it costs a bit more time to decode.
            current_cmd[1] += 1
            current_cmd[2].append(x[i])
            cnt += 1
            i += 1
        elif max_k <= 0b111111:
            # CPY1
            cmd_list.append(current_cmd)
            current_cmd = None
            cmd_list.append([2, max_k-4, max_j-1])
            cnt += max_k
            i += max_k
        else:
            # CPY2
            cmd_list.append(current_cmd)
            current_cmd = None
            cmd_list.append([3, max_k-4, max_j-1])
            cnt += max_k
            i += max_k
    if current_cmd is not None:
        cmd_list.append(current_cmd)
    return cmd_list

def encode(x):
    cmd_list = encode_n(x)
    for cmd in cmd_list:
        if cmd[0] == 0:
            yield cmd[1] & 0b01111111
            yield from cmd[2]
        elif cmd[0] == 2:
            yield 0b10000000|(cmd[1]&0b111111)
            yield cmd[2] & 0b11111111
        elif cmd[0] == 3:
            yield 0b11000000|((cmd[1]&0b11111100000000)>>8)
            yield cmd[1]&0b11111111
            yield cmd[2]&0b11111111

2024.8.17

Back