Extracting Bitmap From Plan 9 Fonts

This has been on my todo list for almost a year. I wanted to do this because this gb/ font (noted "script fonts?" on the wiki page) is (1) actually simplified Chinese fonts that supports the GB2312 standard (probably the reason for the folder name gb) and (2) it's quite an old font. I've only seen it once *ages* ago used by a weird-ass text editor with built-in special input method. I have absolute zero idea how the font got into Plan 9 (if you know this please let me know).

I thought this is gonna take days because I need to painfully read through pure *nix-flavour (?) manual pages, turns out it can be done in one afternoon. Why the heck did I keep this on my todo list for this long..??.?

The font files can be found here. Here is a part of the bitmap I've extracted:

The format of Plan 9 font files

A "font" in plan9 is made up of many "subfont" because:

A Font may contain too many characters to hold in memory simultaneously.

-- cachechars(2) from Plan 9 manual pages

A subfont file contains the bitmap data for a range of Unicode codepoints. Its file name specifies the range or the start of the range. For example the first row of the bitmap above comes from the file named Song.4e00.16, which means this file contains the data of U+4E00~ for the 16px size Song font.

A subfont file is made up of three parts:

When displaying the bitmap for the character c at point p:

The format of the image in plan9 containing the glyphs (described in image(6)) is described as follows:

The decompression of the blocks goes as follows.

With all this knowledge we can finally write the code:

import sys

def main(filename):
    with open(filename, 'rb') as f:
        s = f.read()

    if s.startswith(b'compressed\n'):
        print('compressed.')
        s = s[len(b'compressed\n'):]
        compressed = True
    else:
        print('not compressed.')
        compressed = False

    header1 = s[:12*5].decode('utf-8')
    s = s[12*5:]
    print('chan', header1[:12])
    print('r.min.x', header1[12:24])
    print('r.min.y', header1[24:36])
    print('r.max.x', header1[36:48])
    print('r.max.y', header1[48:])

    rminy = int(header1[24:36].strip())
    rmaxx = int(header1[36:48].strip())
    rmaxy = int(header1[48:].strip())

    if compressed:
        miny = rminy
        code_word_list = []
        # for each block:
        while miny < rmaxy:
            raw_block_header = s[:2*12]; s = s[2*12:]
            maxy = int(raw_block_header[:12].decode('utf-8'))
            nb = int(raw_block_header[12:].decode('utf-8'))
            # extract code words from block
            raw_block_data = s[:nb]; s = s[nb:]
            i = 0
            raw_block_data_len = len(raw_block_data)
            while i < raw_block_data_len:
                if raw_block_data[i] >= 128:
                    this_word_len = raw_block_data[i] - 128+1
                    i += 1
                    code_word_list.append(raw_block_data[i:i+this_word_len])
                    i += this_word_len
                else:
                    a = (raw_block_data[i]&0b01111100)>>2
                    b = ((raw_block_data[i]&0b00000011)<<8)|(raw_block_data[i+1])
                    code_word_list.append((a, b))
                    i += 2
            miny = maxy
        # decompress from code words
        res = b''
        i = 0
        for code_word in code_word_list:
            if type(code_word) is bytes:
                res += code_word
                i += len(code_word)
            else:
                a, b = code_word
                lenx = a+3
                off = b+1
                for _ in range(lenx):
                    res += bytes([res[-off]])
                    i += 1

        with open(f'{filename}.pbm', 'w') as f:
            print(f'P1\n{rmaxx} {rmaxy}', file=f)
            for k in res:
                z = f'{k:08b}'
                for zz in z:
                    print(zz, ' ', sep='', end='', file=f)

        n = int(s[:12].decode('utf-8'))
        height = int(s[12:24].decode('utf-8'))
        ascent = int(s[24:36].decode('utf-8'))
        s = s[36:]
        info_list = []
        for _ in range(n+1):
            raw_info = s[:6]
            x = raw_info[0]|(raw_info[1]<<8)
            top = raw_info[2]
            bottom = raw_info[3]
            left = raw_info[4]
            width = raw_info[5]
            info_list.append((x, top, bottom, left, width))
            s = s[6:]

        with open(f'{filename}.info.txt', 'w') as f:
            print(f'n={n}, height={height}, ascent={ascent}', file=f)
            for x, top, bottom, left, width in info_list:
                print(f'x={x}, top={top}, bottom={bottom}, left={left}, width={width}', file=f)


if __name__ == '__main__':
    main(sys.argv[1])

Maybe one day I'll learn how TrueType works and compile a .ttf out of this... There's surprisingly quite a lot of stuff to learn.