lz77
LZ77 is a lossless compression algorithm, invented by Abraham Lempel and Jacob Ziv in 1977. The basic idea is that when redundancy in input data happens (i.e. same thing happening for a second or third or etc. times) one could refer back to it its previous occurence instead of repeating that same thing again. Normally we will have a hard limit on how far back we can look and refer because it wouldn't be very reasonable for us to allow "infinite" look-back. (This is where all the "sliding-window" talk comes from. We will also refer to this limit as "the maximum window size" or simply "the window size".)
From the basic idea mentioned above the compression output naturally would have two different kinds of commands: "literal"-type commands that ask the decompression algorithm to directly copies data to the output, and "copy"-type commands which ask the decompression algorithm to copy the substring N characters back starting at the last N character in the output (why output? Because the output of a decompression algorithm is the input the compression algorithm originally takes.)
The decompression process is thus as follows:
- We first have a cyclic buffer of size the maximum window size. It is (or it can be) cyclic because (1) we will only add things to the buffer (2) we are allowed to forget (or forced to forget) the
- For every encoded action of the input:
- If it is a "literal" action, (this kind of actions often, if not always, followed by a length and a string of data of that length). Directly copy this data to the output and to the buffer.
- If it is a "copy" action (this kind of actions often, if not always, followed by the "offset" of the look-back and the length of this copy). Starting from the provided offset, copy each character from the buffer to the output and also add it to the buffer.
We shall use the ULZ format from Devine Lu Linvega as an example. In the format, there are one "literal" command and two "copy" command:
LIT
: 3 bytes. max length 7 bits (1~128), followed by the bytes that should be directly copied to the output.CPY1
: 2 bytes. max length 6 bits (4~67), followed by 1-byte offset field (1~256).CPY2
: 3 bytes. max lenght 14 bits (4~16387), followed by 1-byte offset field (1~256).
Other file formats that uses LZ77 would of course have different command formats than this one, for the obvious reason of the compression rate of the algorithm depends on the input, the window size, the kind of redundancy that you choose to eliminate, how you encode and represent the two kinds of commands and other aspects. Notice that in the format above the length field of CPY1
and CPY2
represents 4 at minimum; this means that when compressing, redundancy of length strictly smaller than 4 is ignored. This is because redundancies of length smaller than 4 don't make sense to be encoded as a CPY1
or CPY2
command since such action costs at least 2+1 bytes (the last 1 byte being the cost of starting another LIT
command).
The code for decompression is thus as follows:
def decode(x): # cyclic buffer of length 256. window = [0 for _ in range(256)] window_i = 0 x_len = len(x) # `i` is the index into the input. i = 0 while i < x_len: # determining the type of the command) b = x[i] if not (b & 0b10000000): # it's a LIT command. b = b & 0b01111111 # data starts from the next byte of input. i += 1 for _ in range(b+1): # output the byte yield x[i] # push it back to the buffer window[window_i] = x[i] window_i = (window_i + 1) % 256 i += 1 elif (b & 0b10000000): # it's a CPY1 or CPY2 command. they both have a length and an # offset field, so we combine the handling of both here. if not (b & 0b01000000): i += 1 l = b & 0b00111111 else: i += 2 l = ((b & 0b00111111) << 8) | x[i-1] l += 4 off = x[i] # the copying. for _ in range(l): bi = (window_i - off - 1 + 256) % 256 # output the byte yield window[bi] # push back into the buffer window[window_i] = window[bi] window_i = (window_i + 1) % 256 # next command. i += 1
The pseudocode for LZ77 on the Wikipedia is absolutely horrible:
while input is not empty do match := longest repeated occurrence of input that begins in window if match exists then d := distance to start of match l := length of match c := char following match in input else d := 0 l := 0 c := first char of input end if output (d, l, c) discard l + 1 chars from front of window s := pop l + 1 chars from front of input append s to back of window repeat
The hardest part of LZ77 is simply described as "longest repeated occurrence of input that begins in window" and "match exists". While they are technically not wrong, they don't provide any useful information about how to implement them either. I shall describe the compression process in much more details here.
The compression process goes as follows:
- Have a variable CMD, recording the current command. We shall initialize it to (LIT, Input[0]), since there is obviously no redundancy for the first character in the input.
- (Index I) Starting from the first byte of input and we work forward; for every byte we read, we:
- Have a variable MAXJ and MAXK to record the offset and the length of the maximum redundancy we've found for this value of I.
(Index J) Starting from the first byte before that byte and we work backward; for every byte, we:
- Have a counter K;
- While
Input[I+K]
is the same asInput[J+K]
, increases K; - When the while loop above ends, the value of K would be the length of the redundancy we found. It could be zero, which means no redundancy with the given I and J; it could be of other value, which means redundancy exists.
- If K is larger than the current value of MAXK, record the current J and K in MAXJ and MAXK respectively.
Repeat, until J reaches the maximum window size (256 in our case). This is the "longest repeated occurrence of input that begins in window" part.
- After the loop above, we should have the maximum redundancy found in MAXJ and MAXK. For this format, we need to treat the cases where K is strictly smaller than 4 as if there is no redundancy. So:
- If MAXK is strictly smaller than 4, check if CMD is currently a LIT command.
- If yes, check if the command is full (which for this format is having a string of 128 bytes).
- If yes, output CMD and put (LIT, Input[I]) into CMD.
- If no, append Input[I] to the string of the command.
- If no, output CMD, and put (LIT, Input[I]) into CMD.
- If yes, check if the command is full (which for this format is having a string of 128 bytes).
- Or else, output (COPY, MAXJ, MAXK), and put it into CMD.
- If MAXK is strictly smaller than 4, check if CMD is currently a LIT command.
- After the loop above, check if CMD is a LIT command.
- If yes, output CMD. (This is because all LIT commands are kept and not immediately sent to output just in case we need to append more bytes while all COPY commands are sent to output immediately.)
- If no, do nothing.
The time complexity amounts to O(N2*M), where N is the possible length of the input and M is the maximum window size. There is probably a better ("better", as in having a better time complexity) way to do this, but it's rarely worth the hassle.
The code for compression is thus as follows:
# NOTE THAT this is a slightly different version from the algorithm above, # but the spirit is the same. # we will do it in two phases: turning the input into commands, and turning # the commands into byte sequences. of course, one could directly converts # the generated command into byte sequences and thus complete the process # in one single phase. def encode_n(x): # init cnt = 0 x_len = len(x) i = 0 # first byte; this must be included in a LIT because there's no char to copy. cmd_list = [] current_cmd = None while i < x_len: if current_cmd and current_cmd[1] >= 127: cmd_list.append(current_cmd) current_cmd = None if current_cmd is None: current_cmd = [0, 0, [x[i]]] cnt += 1 i += 1 continue # get first rep max_possible_off = min(cnt, 256) j = 1 stp = i - j max_j = 0 max_k = 0 while j <= max_possible_off: stp = i - j this_match_len = 0 k = 0 while ( stp+k < x_len and i+k < x_len and x[stp+k] == x[i+k] ): k += 1 this_match_len = k if this_match_len > max_k: max_k = this_match_len max_j = j j += 1 if max_k <= 3: # cannot find first rep; still LIT. # match of 1 characters are meaningless because to encode copy you need more than 1 byte # match of 2 or 3 characters are also meaningless because while such copy does not cost # more than LIT it costs a bit more time to decode. current_cmd[1] += 1 current_cmd[2].append(x[i]) cnt += 1 i += 1 elif max_k <= 0b111111: # CPY1 cmd_list.append(current_cmd) current_cmd = None cmd_list.append([2, max_k-4, max_j-1]) cnt += max_k i += max_k else: # CPY2 cmd_list.append(current_cmd) current_cmd = None cmd_list.append([3, max_k-4, max_j-1]) cnt += max_k i += max_k if current_cmd is not None: cmd_list.append(current_cmd) return cmd_list def encode(x): cmd_list = encode_n(x) for cmd in cmd_list: if cmd[0] == 0: yield cmd[1] & 0b01111111 yield from cmd[2] elif cmd[0] == 2: yield 0b10000000|(cmd[1]&0b111111) yield cmd[2] & 0b11111111 elif cmd[0] == 3: yield 0b11000000|((cmd[1]&0b11111100000000)>>8) yield cmd[1]&0b11111111 yield cmd[2]&0b11111111
2024.8.17