lz77

LZ77 is a lossless compression algorithm, invented by Abraham Lempel and Jacob Ziv in 1977. The basic idea is that when redundancy in input data happens (i.e. same thing happening for a second or third or etc. times) one could refer back to it its previous occurence instead of repeating that same thing again. Normally we will have a hard limit on how far back we can look and refer because it wouldn't be very reasonable for us to allow "infinite" look-back. (This is where all the "sliding-window" talk comes from. We will also refer to this limit as "the maximum window size" or simply "the window size".)

From the basic idea mentioned above the compression output naturally would have two different kinds of commands: "literal"-type commands that ask the decompression algorithm to directly copies data to the output, and "copy"-type commands which ask the decompression algorithm to copy the substring N characters back starting at the last N character in the output (why output? Because the output of a decompression algorithm is the input the compression algorithm originally takes.)

The decompression process is thus as follows:

We first have a cyclic buffer of size the maximum window size. It is (or it can be) cyclic because (1) we will only add things to the buffer (2) we are allowed to forget (or forced to forget) the
For every encoded action of the input:
- If it is a "literal" action, (this kind of actions often, if not always, followed by a length and a string of data of that length). Directly copy this data to the output and to the buffer.
- If it is a "copy" action (this kind of actions often, if not always, followed by the "offset" of the look-back and the length of this copy). Starting from the provided offset, copy each character from the buffer to the output and also add it to the buffer.

We shall use the ULZ format from Devine Lu Linvega as an example. In the format, there are one "literal" command and two "copy" command:

LIT: 3 bytes. max length 7 bits (1~128), followed by the bytes that should be directly copied to the output.
CPY1: 2 bytes. max length 6 bits (4~67), followed by 1-byte offset field (1~256).
CPY2: 3 bytes. max lenght 14 bits (4~16387), followed by 1-byte offset field (1~256).

Other file formats that uses LZ77 would of course have different command formats than this one, for the obvious reason of the compression rate of the algorithm depends on the input, the window size, the kind of redundancy that you choose to eliminate, how you encode and represent the two kinds of commands and other aspects. Notice that in the format above the length field of CPY1 and CPY2 represents 4 at minimum; this means that when compressing, redundancy of length strictly smaller than 4 is ignored. This is because redundancies of length smaller than 4 don't make sense to be encoded as a CPY1 or CPY2 command since such action costs at least 2+1 bytes (the last 1 byte being the cost of starting another LIT command).

The code for decompression is thus as follows:

def decode(x):
    # cyclic buffer of length 256.
    window = [0 for _ in range(256)]
    window_i = 0
    x_len = len(x)
    # `i` is the index into the input.
    i = 0
    while i < x_len:
        # determining the type of the command)
        b = x[i]
        if not (b & 0b10000000):
            # it's a LIT command.
            b = b & 0b01111111
            # data starts from the next byte of input.
            i += 1
            for _ in range(b+1):
                # output the byte
                yield x[i]
                # push it back to the buffer
                window[window_i] = x[i]
                window_i = (window_i + 1) % 256
                i += 1
        elif (b & 0b10000000):
            # it's a CPY1 or CPY2 command. they both have a length and an
            # offset field, so we combine the handling of both here.
            if not (b & 0b01000000):
                i += 1
                l = b & 0b00111111
            else:
                i += 2
                l = ((b & 0b00111111) << 8) | x[i-1]
            l += 4
            off = x[i]
            # the copying.
            for _ in range(l):
                bi = (window_i - off - 1 + 256) % 256
                # output the byte
                yield window[bi]
                # push back into the buffer
                window[window_i] = window[bi]
                window_i = (window_i + 1) % 256
            # next command.
            i += 1

The pseudocode for LZ77 on the Wikipedia is absolutely horrible:

while input is not empty do
    match := longest repeated occurrence of input that begins in window

    if match exists then
        d := distance to start of match
        l := length of match
        c := char following match in input
    else
        d := 0
        l := 0
        c := first char of input
    end if

    output (d, l, c)

    discard l + 1 chars from front of window
    s := pop l + 1 chars from front of input
    append s to back of window
repeat

The hardest part of LZ77 is simply described as "longest repeated occurrence of input that begins in window" and "match exists". While they are technically not wrong, they don't provide any useful information about how to implement them either. I shall describe the compression process in much more details here.

The compression process goes as follows:

Have a variable CMD, recording the current command. We shall initialize it to (LIT, Input[0]), since there is obviously no redundancy for the first character in the input.
(Index I) Starting from the first byte of input and we work forward; for every byte we read, we:
- Have a variable MAX_J and MAX_K to record the offset and the length of the maximum redundancy we've found for this value of I.
- (Index J) Starting from the first byte before that byte and we work backward; for every byte, we:
  - Have a counter K;
  - While Input[I+K] is the same as Input[J+K], increases K;
  - When the while loop above ends, the value of K would be the length of the redundancy we found. It could be zero, which means no redundancy with the given I and J; it could be of other value, which means redundancy exists.
  - If K is larger than the current value of MAX_K, record the current J and K in MAX_J and MAX_K respectively.
  Repeat, until J reaches the maximum window size (256 in our case). This is the "longest repeated occurrence of input that begins in window" part.
- After the loop above, we should have the maximum redundancy found in MAX_J and MAX_K. For this format, we need to treat the cases where K is strictly smaller than 4 as if there is no redundancy. So:
  - If MAX_K is strictly smaller than 4, check if CMD is currently a LIT command.
    - If yes, check if the command is full (which for this format is having a string of 128 bytes).
      - If yes, output CMD and put (LIT, Input[I]) into CMD.
      - If no, append Input[I] to the string of the command.
    - If no, output CMD, and put (LIT, Input[I]) into CMD.
  - Or else, output (COPY, MAX_J, MAX_K), and put it into CMD.
After the loop above, check if CMD is a LIT command.
- If yes, output CMD. (This is because all LIT commands are kept and not immediately sent to output just in case we need to append more bytes while all COPY commands are sent to output immediately.)
- If no, do nothing.

The time complexity amounts to O(N²*M), where N is the possible length of the input and M is the maximum window size. There is probably a better ("better", as in having a better time complexity) way to do this, but it's rarely worth the hassle.

The code for compression is thus as follows:

# NOTE THAT this is a slightly different version from the algorithm above,
# but the spirit is the same.
# we will do it in two phases: turning the input into commands, and turning
# the commands into byte sequences. of course, one could directly converts
# the generated command into byte sequences and thus complete the process
# in one single phase.
def encode_n(x):
    # init
    cnt = 0
    x_len = len(x)
    i = 0
    # first byte; this must be included in a LIT because there's no char to copy.
    cmd_list = []
    current_cmd = None
    while i < x_len:
        if current_cmd and current_cmd[1] >= 127:
            cmd_list.append(current_cmd)
            current_cmd = None
        if current_cmd is None:
            current_cmd = [0, 0, [x[i]]]
            cnt += 1
            i += 1
            continue
        # get first rep
        max_possible_off = min(cnt, 256)
        j = 1
        stp = i - j
        max_j = 0
        max_k = 0
        while j <= max_possible_off:
            stp = i - j
            this_match_len = 0
            k = 0
            while (
                stp+k < x_len
                and i+k < x_len
                and x[stp+k] == x[i+k]
            ):
                k += 1
            this_match_len = k
            if this_match_len > max_k:
                max_k = this_match_len
                max_j = j
            j += 1
        if max_k <= 3:
            # cannot find first rep; still LIT.
            # match of 1 characters are meaningless because to encode copy you need more than 1 byte
            # match of 2 or 3 characters are also meaningless because while such copy does not cost
            # more than LIT it costs a bit more time to decode.
            current_cmd[1] += 1
            current_cmd[2].append(x[i])
            cnt += 1
            i += 1
        elif max_k <= 0b111111:
            # CPY1
            cmd_list.append(current_cmd)
            current_cmd = None
            cmd_list.append([2, max_k-4, max_j-1])
            cnt += max_k
            i += max_k
        else:
            # CPY2
            cmd_list.append(current_cmd)
            current_cmd = None
            cmd_list.append([3, max_k-4, max_j-1])
            cnt += max_k
            i += max_k
    if current_cmd is not None:
        cmd_list.append(current_cmd)
    return cmd_list

def encode(x):
    cmd_list = encode_n(x)
    for cmd in cmd_list:
        if cmd[0] == 0:
            yield cmd[1] & 0b01111111
            yield from cmd[2]
        elif cmd[0] == 2:
            yield 0b10000000|(cmd[1]&0b111111)
            yield cmd[2] & 0b11111111
        elif cmd[0] == 3:
            yield 0b11000000|((cmd[1]&0b11111100000000)>>8)
            yield cmd[1]&0b11111111
            yield cmd[2]&0b11111111

2024.8.17

Back