Design of a Python to Postscript compiler

NOTE: I was learning Postscript during the vacation for Chinese New Year. This was meant to be completed in 2 weeks but I heavily underestimated the workload as I always do (and I'd really like to focus on other things now). This blogpost mainly consists of short notes from the thought process. Maybe one day I'll work on this idea again and hopefully produce a working compiler.

Supported subset of Python

I can't possibly adapt the whole python into postscript (even if it's possible it'll be way too much work) so this subset will have to do.

-- this is originally taken from Python's documentation:
--     https://docs.python.org/3/library/ast.html#abstract-grammar

module Pyps
{
    mod = Module(stmt* body)
        -- normally this means it's entered thru a REPL. but i probably
        -- don't have the time to make a repl so... :|
        | Interactive(stmt* body)

    stmt = FunctionDef(identifier name, arguments args,
                stmt* body,
                expr? returns)

         | Return(expr? value)
         | Delete(expr* targets)
         | Assign(expr* targets, expr value)
         | AugAssign(expr target, operator op, expr value)
         | AnnAssign(expr target, expr annotation, expr? value)
         | If(expr test, stmt* body, stmt* orelse)
         | While(expr test, stmt* body, stmt* orelse)
         | For(expr target, expr iter, stmt* body, stmt* orelse)

         -- i'm not going to add any error handling stuff right now because i
         -- haven't quite figured out how things work in postscript yet

         | Pass
         | Break
         | Continue
    
    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | UnaryOp(unaryop op, expr operand)
         | IfExp(expr test, expr body, expr orelse)
         | Dict(expr* keys, expr* values)
         | ListComp(expr elt, comprehension* generators)
         | DictComp(expr elt, comprehension* generators)
         | Compare(expr left, cmpop* ops, expr* comparators)
         | Call(expr func, expr* args, keyword* keywords)
         -- NOTE: slice, when applied on arrays, will be considered as the
         -- equivalent to "making a new array & fill it with for-loop`, and
         -- is handled the same way.
         | Subscript(expr value, expr slice)
         | Name(identifier id)
         | List(expr* elts)
         | Tuple(expr* elts)
        
    boolop = And | Or

    operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
                 | RShift | BitOr | BitXor | BitAnd | FloorDiv

    unaryop = Invert | Not | UAdd | USub

    cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn
    
    comprehension = (expr target, expr iter, expr* ifs, int is_async)

    -- could have to extra work to support keyword args but i'll leave it
    -- out for now.
    arguments = (arg* args)

    arg = (identifier arg, expr? annotation)
}

Type inference

In order to compile python into postscript correctly a certain degree of type inference is needed (see below, "General compiling scheme").

General compiling scheme

Handling expressions

Normally we can translates simple expressions to postscript in a bottom-up manner. For example (lowest + highest) / 2 in python can be translated into lowest highest add 2 div in postscript. Other constructs (listed below) are not that easy.

Lists, tuples and dicts literals

List and dict comprehensions are discussed later.

Operations on lists & dicts

There are no direct equivalent of OOP objects & method calls in postscript and it definitely isn't worth it to implement OOP in postscript (we already have enough "oop" lang bullshit now please leave postscript alone); but whether we implement oop constructs or not we'll have to manage the operations ourselves.

Take list.append in python for an example. Arrays are fixed size in postscript so there's no list.append; to append something to an array we have to create a new one with a bigger size, but this will break reference, so python lists are translated into an explicit reference of an array implemented with a postscript dict.

Strings & byte-strings

Postscript only has what should probably be called byte-strings; a simple way to solve this issue is to translate all string literals into big-endian utf32 (e.g. "你好世界", which is U+4F60, U+597D, U+4E16 and U+754C, will be translated into (\000\000\117\140\000\000\131\175\000\000\116\026\000\000\165\114)) and multiply or divide the lengths & indices by 4 as we need.

Operations on strings & byte-strings

The translations of string operations and byte-string operations are the same as long as they don't invole lengths & indices. Possible implementation of a few operations are listed below.

Subscripts & slices of (byte-)strings, tuples & arrays

Like mentioned above, postscript only has byte-strings. If we do encode unicode strings with big-endian utf32, then String[Start:End] can be translated into String Start 4 mul End 4 mul Start 4 mul sub getinterval. The translation for byte-string & array subscripts & slices are the same because getinterval is polymorphic: A[Start:End] will simply become A Start End Start sub getinterval.

When there's a Step part, one will have to fallback to the old "append-loop" method, i.e. for any A[Start:End:Step] it's equivalent to this:

# for arrays.
# some variable representing the stack top.
_ = []
for i in range(Start, End, Step):
    _.append(A[i])

# for bytestrings.
_ = b''
for i in range(Start, End, Step):
    _ = _ + A[i]

which is then translated like a normal for loop with range (discussed later).

Calls

A call like Func(arg0, arg1, ..., argn) is simply translated into arg0 arg1 ... argn Func when Func is an identifier. When it's not an identifier (e.g. something like some_dict["some_key"](args); some people do like their table-driven method in python) chances are a packed array representing Func will be pushed onto the stack rather than actually being executed, so the general way is translating it into arg0 arg1 ... argn Func exec.

Handling assignment statements

When the lvalue is an identifier:

Variable = Expr

it can be simply translated into this:

/Variable
    % insert whatever the heck `Expr` translates into
def

So-called "augmenting assignment" is similar: A += B is simply A = A + B and can be translated as if it's A = A + B.

When the lvalue is a subscript, i.e. something like A[B] = Expr, the translation depends on the type of A: it can be translated into A B Expr put (if A is an array) or A B cvn Expr put (if A is a dict).

Handling function definitions

The way to create a "local scope" in postscript is to create a new dict (with dict) and push it onto/pop it from the dict stack (done by begin and end). e.g. the factorial function can be defined in postscript as:

% fac: [n] --- [n!]
/fac {
    % create a dict and use it as a scope
    1 dict begin

    % define `n` in said scope
    /n exch def

    % the main part
    n 1 le
        { 1 }
        { n n 1 sub fac mul }
    ifelse

    % exit the scope
    end
}

There's an extra detail needs to be taken care of:

This requires knowing all the variable names used within a function definition (which is actually trivial). Function definitions are thus translated into postscript like the one shown below:

def F(a0, a1, ..., an):
    do_things()
/F {
    % where "!n" is the number of identifiers directly
    % used as a lvalue inside the function (including the parameters)。
    !n dict begin

    % the parameter names are in reverse order because how things work in postscript...
    [/an ... /a1 /a0] {exch def} forall

    % do_things
    do_things

    end
} def

Handling if

Single clause if is trivial:

if A:
    Body
A { Body } if

The translations of if-else statement and if-else expression are exactly the same; the only difference is whether there'll be result on the stack after the execution:

Value1 if A else Value2

# or:
if A:
    Body1
else:
    Body2
A { Value1 } { Value2 } ifelse

% or:
A { Body1 } { Body2 } ifelse

if-elif-else is simply a series of stacking if-else, and is converted as such:

if A.1:
    Body1
elif A.2:
    Body2
elif A.3:
    Body3
else:
    Body4

# this is equivalent to:
if A.1:
    Body1
else:
    if A.2:
        Body2
    else:
        if A.3:
            Body3
        else:
            Body4
A.1 { Body1 }
{
    A.2 { Body2 }
    {
        A.3 { Body3 }
        {
            Body4
        } ifelse
    } ifelse
} ifelse

Handling while

Easy to notice simple while loops without continue like this:

while A:
    if A.1:
        break
    else:
        do_work()

can be translated into postscript like this:

{
    A not { exit } if
    
    A.1 { exit } { do_work } ifelse
} loop

When continue is involved it's a little bit trickier. See below.

Handling break and continue

break and continue often appears within the same while loop like this:

while A:
    if A.1:
        break
    elif A.2:
        continue
    else:
        do_work()

To summarize we have:

But in postscript there's no directly similar construct as continue; instead we have:

This can be demonstrated with the following example: exiting by exit is not captured by stopped, while exiting by stop, is.

{ { exit } loop } stopped =  % prints `false`
{ { stop } loop } stopped =  % prints `true`

Easy to see continue as a special kind of exit that the execution flow will go back into the loop again upon exit, and the action of "going back into the loop" can be done by wrapping another layer of loop. So for the python example above we can construct this:

{
    % while-condition not met. exit immediately.
    A not { exit } if

    % this whole `loop-stopped-if` block will exit the *outer* loop when
    % executing the program fragment inside lead to a `stop` being executed.
    {

        % this loop is necessary to execu
        {
            % while-condition not met. exit immediately.
            % note that this is the while-condition; we convert this:
            %     while A:
            %         ...
            % into this here:
            %     while True:
            %         if not A: break
            %         ...
            A not { stop } if

            % "if A.1: break"
            A.1
                { stop } 
                {
                    % "elif A.2: continue" and "else: do_work()"
                    A.2
                        { exit }
                        { do_work }
                    ifelse
                }
            if
        } loop
    } stopped
        % we exit the loop here because `stop` (which in terms corresponds
        % to a `break`) was executed. if `exit` is executed, it'll continue
        % to execute because there's the outer `loop`.
        { exit } if
} loop

Handling for

Here's a few thing that needs to be careful about python's for:

So the following python program works perfectly fine:

# oh yes the `_` is not a "special" wildcard identifier either - it's just
# a normal variable name like any other variable names.
total_sum = 0
for i, _ in enumerate(range(10)):
    pass

print(i, _)
# prints "9 9"

for with range

Note that postscript for is "inclusive", i.e. when the looping variable (the i part in for i in iter) is exactly equal to the end value, the body is also executed, so while you calculate Sum(i=1→100) in python like this:

total_sum = 0
for i in range(1, 101):
    total_sum += i

in postscript it's this:

/total_sum 0 def
% note that it's 100 not 101
1 1 100 { total_sum add /total_sum exch def } for

The actual End for postscript is a bit tricky to calculate; the easy way out is to once again using the equivalent while. So the program:

for i in range(Start, End, Step):
    Body

will be seen as:

i = Start
while i < End:
    Body
    i += Step

and thus results in:

/i Start def
{
    i End lt not { exit } if
    {
        i End lt not { stop } if
        Body
        /i i Step add def
    } stopped { exit } if
} loop

for with enumerate

for with enumerate is a different story:

So a for loop like this:

for Index, Value in enumerate(Iter):
    Body

can be translated into this:

% we also have to manage the index on our own.
/Index 0 def
{
    % if `Iter` is a dict then insert `pop` here.
    /Value exch def
    /Index Index 1 add def
} forall

for with other iterators

The translation of this kind of for is similar to the translation of for with enumerate.

Comprehensions

Comprehensions can be done in a similar way of for. The list comprehension form [Expr(Var) for Var in Iter if Cond(Var)] is equivalent to:

_ = []
for Var in Iter:
    if Cond(Var):
        _.append(Expr(Var))

List comprehension forms with multiple for-clause, e.g. [Expr(Var1, Var2) for Var1 in Iter2 if Cond1(Var1) for Var2 in Iter2 if Cond2(Var2)] is equivalent to:

_ = []
for Var1 in Iter1:
    if Cond1(Var1):
        for Var2 in Iter2:
            if Cond2(Var2):
                _.append(Expr(Var1, Var2))

Dict comprehensions are similar but one faces the same numeric key problem as translating dict subscript (mentioned above). {Key(i): Value(i) for i in Iter if Cond(i)} is equivalent to:

_ = {}
for i in Iter:
    if Cond(i):
        _[Key(i)] = Value(i)

and thus can be translated as such.