Design of a Python to Postscript compiler
NOTE: I was learning Postscript during the vacation for Chinese New Year. This was meant to be completed in 2 weeks but I heavily underestimated the workload as I always do (and I'd really like to focus on other things now). This blogpost mainly consists of short notes from the thought process. Maybe one day I'll work on this idea again and hopefully produce a working compiler.
Supported subset of Python
I can't possibly adapt the whole python into postscript (even if it's possible it'll be way too much work) so this subset will have to do.
-- this is originally taken from Python's documentation: -- https://docs.python.org/3/library/ast.html#abstract-grammar module Pyps { mod = Module(stmt* body) -- normally this means it's entered thru a REPL. but i probably -- don't have the time to make a repl so... :| | Interactive(stmt* body) stmt = FunctionDef(identifier name, arguments args, stmt* body, expr? returns) | Return(expr? value) | Delete(expr* targets) | Assign(expr* targets, expr value) | AugAssign(expr target, operator op, expr value) | AnnAssign(expr target, expr annotation, expr? value) | If(expr test, stmt* body, stmt* orelse) | While(expr test, stmt* body, stmt* orelse) | For(expr target, expr iter, stmt* body, stmt* orelse) -- i'm not going to add any error handling stuff right now because i -- haven't quite figured out how things work in postscript yet | Pass | Break | Continue expr = BoolOp(boolop op, expr* values) | BinOp(expr left, operator op, expr right) | UnaryOp(unaryop op, expr operand) | IfExp(expr test, expr body, expr orelse) | Dict(expr* keys, expr* values) | ListComp(expr elt, comprehension* generators) | DictComp(expr elt, comprehension* generators) | Compare(expr left, cmpop* ops, expr* comparators) | Call(expr func, expr* args, keyword* keywords) -- NOTE: slice, when applied on arrays, will be considered as the -- equivalent to "making a new array & fill it with for-loop`, and -- is handled the same way. | Subscript(expr value, expr slice) | Name(identifier id) | List(expr* elts) | Tuple(expr* elts) boolop = And | Or operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift | RShift | BitOr | BitXor | BitAnd | FloorDiv unaryop = Invert | Not | UAdd | USub cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn comprehension = (expr target, expr iter, expr* ifs, int is_async) -- could have to extra work to support keyword args but i'll leave it -- out for now. arguments = (arg* args) arg = (identifier arg, expr? annotation) }
Type inference
In order to compile python into postscript correctly a certain degree of type inference is needed (see below, "General compiling scheme").
General compiling scheme
Handling expressions
Normally we can translates simple expressions to postscript in a bottom-up manner. For example (lowest + highest) / 2
in python can be translated into lowest highest add 2 div
in postscript. Other constructs (listed below) are not that easy.
Lists, tuples and dicts literals
- Python lists can be translated into a postscript dict with a single
value
field. The reason for this is explained below, in the "Operations on lists & dicts" section.- e.g.
[3, 4, 5]
will become<< /value [ 3 4 5 ] cvlit >>
.
- e.g.
- Python tuples can be translated into postscript
packedarray
s, thencvlit
andreadonly
is added.- e.g.
(3, "abc")
will become{ 3 (abc) } cvlit readonly
.
- e.g.
- Python dicts can be translated into postscript
dict
s with<<
>>
and name objects.- e.g.
{"abc": 3, "def": 4}
will become<< /abc 3 /def 4 >> cvlit
. - note that in python you can use expressions as keys in dict literals, e.g. executing
a = "abc"; b = {a: 3}
will assignb
with{"abc": 3}
. this is done in postscript withcvn
. for this example, the translation result for the{a: 3}
part will be<< a cvn 3 >> cvlit
.
- e.g.
List and dict comprehensions are discussed later.
Operations on lists & dicts
There are no direct equivalent of OOP objects & method calls in postscript and it definitely isn't worth it to implement OOP in postscript (we already have enough "oop" lang bullshit now please leave postscript alone); but whether we implement oop constructs or not we'll have to manage the operations ourselves.
Take list.append
in python for an example. Arrays are fixed size in postscript so there's no list.append
; to append something to an array we have to create a new one with a bigger size, but this will break reference, so python lists are translated into an explicit reference of an array implemented with a postscript dict.
- for lists:
A1 + A2
: note that in python adding two lists creates new list, so this is okay:<< /value A1 /value get aload pop A2 /value get aload pop A1 /value get length A2 /value get length add array astore >>
A.append(B)
:% setup for later `put` A % unpack the original array A /value get aload % create a new array dup length 1 add array % delete the original array exch pop % creates the new array B exch astore % put it back into the reference /value exch put
A.pop()
:A A /value get aload dup length 1 sub array exch pop exch pop astore /value exch put
- for dicts:
A[B]
andA.get(B)
:The difference between
A[B]
andA.get(B)
in Python is that the former throws an error whenB
does not exist inA
while the latter simply returnsNone
. This issue is easy to solve: the latter is essentially equivalent toNone if B not in A else A[B]
and can be translated as such. No, the real issue is in generating the code for the keys - specifically numeric keys. In postscript one cannot directlycvn
a number, in order to turn it into a name object one must first turn it into a string; but postscript strings are fixed-length & have to be pre-allocated, so one should preferrably knows how long the string representation of the said number would be beforehand (which when it's dynamic it's impossible) - or one can just allocate one big long string as the buffer because postscriptcvs
truncates the string for you.In the simple case of string keys, the former can simply be translated into
A B cvn get
because postscriptget
do throw an error when the key does not exist in dict. For the latter, as mentioned before, can be wrapped in anifelse
with aknown
test:A B cvn known { null } { A B cvn get } ifelse
A.keys()
:Postscript
forall
puts the key then the value for every key-value pair in dict. If these two objects are not discarded from the stack, they'll remain throughout theforall
; we can thus keep only the key objects and put them into an array:A { pop } forall A length array astore
A.items()
:Same as above with a different
forall
proc - instead of discarding the value object we pack the key and the value into a 2-array:A { 2 array astore } forall A length array astore
A.pop(B)
:A B cvn undef
Strings & byte-strings
Postscript only has what should probably be called byte-strings; a simple way to solve this issue is to translate all string literals into big-endian utf32 (e.g. "你好世界"
, which is U+4F60
, U+597D
, U+4E16
and U+754C
, will be translated into (\000\000\117\140\000\000\131\175\000\000\116\026\000\000\165\114)
) and multiply or divide the lengths & indices by 4 as we need.
Operations on strings & byte-strings
The translations of string operations and byte-string operations are the same as long as they don't invole lengths & indices. Possible implementation of a few operations are listed below.
A1 + A2
:A1 length A2 length add string dup 0 A1 putinterval dup A1 length A2 putinterval
A.join(B)
:there are probably a lot of solutions more simple than this mess.
B length 0 eq { () } { B length 1 eq { B 0 get } { B 0 get 1 1 B length 1 sub { 2 dict begin /i exch def /prev exch def prev length A length add B i get length add string dup 0 prev putinterval dup prev length A putinterval dup prev length A length add B i get putinterval end } for } ifelse } ifelse
len(A)
:A length % if A is infered to be a string (rather than byte-string), `4 div` is inserted after.
Subscripts & slices of (byte-)strings, tuples & arrays
Like mentioned above, postscript only has byte-strings. If we do encode unicode strings with big-endian utf32, then String[Start:End]
can be translated into String Start 4 mul End 4 mul Start 4 mul sub getinterval
. The translation for byte-string & array subscripts & slices are the same because getinterval
is polymorphic: A[Start:End]
will simply become A Start End Start sub getinterval
.
When there's a Step
part, one will have to fallback to the old "append-loop" method, i.e. for any A[Start:End:Step]
it's equivalent to this:
# for arrays. # some variable representing the stack top. _ = [] for i in range(Start, End, Step): _.append(A[i]) # for bytestrings. _ = b'' for i in range(Start, End, Step): _ = _ + A[i]
which is then translated like a normal for
loop with range
(discussed later).
Calls
A call like Func(arg0, arg1, ..., argn)
is simply translated into arg0 arg1 ... argn Func
when Func
is an identifier. When it's not an identifier (e.g. something like some_dict["some_key"](args)
; some people do like their table-driven method in python) chances are a packed array representing Func
will be pushed onto the stack rather than actually being executed, so the general way is translating it into arg0 arg1 ... argn Func exec
.
Handling assignment statements
When the lvalue is an identifier:
Variable = Expr
it can be simply translated into this:
/Variable % insert whatever the heck `Expr` translates into def
So-called "augmenting assignment" is similar: A += B
is simply A = A + B
and can be translated as if it's A = A + B
.
When the lvalue is a subscript, i.e. something like A[B] = Expr
, the translation depends on the type of A
: it can be translated into A B Expr put
(if A
is an array) or A B cvn Expr put
(if A
is a dict).
Handling function definitions
The way to create a "local scope" in postscript is to create a new dict (with dict
) and push it onto/pop it from the dict stack (done by begin
and end
). e.g. the factorial function can be defined in postscript as:
% fac: [n] --- [n!] /fac { % create a dict and use it as a scope 1 dict begin % define `n` in said scope /n exch def % the main part n 1 le { 1 } { n n 1 sub fac mul } ifelse % exit the scope end }
There's an extra detail needs to be taken care of:
- In postscript the size of dict when created is fixed. (note that
dict
takes a "size" parameter) - In postscript languagelevel 1, when this size is exceeded, a
dictfull
error will occur. Even though ghostscript (the current go-to implementation for postscript) supports up to languagelevel 3, I decide it's better to be aware of this.
This requires knowing all the variable names used within a function definition (which is actually trivial). Function definitions are thus translated into postscript like the one shown below:
def F(a0, a1, ..., an): do_things()
/F { % where "!n" is the number of identifiers directly % used as a lvalue inside the function (including the parameters)。 !n dict begin % the parameter names are in reverse order because how things work in postscript... [/an ... /a1 /a0] {exch def} forall % do_things do_things end } def
Handling if
Single clause if
is trivial:
if A: Body
A { Body } if
The translations of if-else
statement and if-else
expression are exactly the same; the only difference is whether there'll be result on the stack after the execution:
Value1 if A else Value2 # or: if A: Body1 else: Body2
A { Value1 } { Value2 } ifelse % or: A { Body1 } { Body2 } ifelse
if-elif-else
is simply a series of stacking if-else
, and is converted as such:
if A.1: Body1 elif A.2: Body2 elif A.3: Body3 else: Body4 # this is equivalent to: if A.1: Body1 else: if A.2: Body2 else: if A.3: Body3 else: Body4
A.1 { Body1 } { A.2 { Body2 } { A.3 { Body3 } { Body4 } ifelse } ifelse } ifelse
Handling while
Easy to notice simple while
loops without continue
like this:
while A: if A.1: break else: do_work()
can be translated into postscript like this:
{ A not { exit } if A.1 { exit } { do_work } ifelse } loop
When continue
is involved it's a little bit trickier. See below.
Handling break
and continue
break
and continue
often appears within the same while
loop like this:
while A: if A.1: break elif A.2: continue else: do_work()
To summarize we have:
- loop (done by
while
andfor
) - a way to exit the loop completely (done by
break
) - a way to prematurely proceed to the next iteration (done by
continue
)
But in postscript there's no directly similar construct as continue
; instead we have:
- loop (done by
for
,forall
,loop
and other looping constructs.) - a way to exit the loop (done by
exit
) - another way to exit the loop while kind of signifying an error like raising an exception (done by
stop
) - a way to detect such exit (done by
stopped
)
This can be demonstrated with the following example: exiting by exit
is not captured by stopped
, while exiting by stop
, is.
{ { exit } loop } stopped = % prints `false` { { stop } loop } stopped = % prints `true`
Easy to see continue
as a special kind of exit that the execution flow will go back into the loop again upon exit, and the action of "going back into the loop" can be done by wrapping another layer of loop
. So for the python example above we can construct this:
{ % while-condition not met. exit immediately. A not { exit } if % this whole `loop-stopped-if` block will exit the *outer* loop when % executing the program fragment inside lead to a `stop` being executed. { % this loop is necessary to execu { % while-condition not met. exit immediately. % note that this is the while-condition; we convert this: % while A: % ... % into this here: % while True: % if not A: break % ... A not { stop } if % "if A.1: break" A.1 { stop } { % "elif A.2: continue" and "else: do_work()" A.2 { exit } { do_work } ifelse } if } loop } stopped % we exit the loop here because `stop` (which in terms corresponds % to a `break`) was executed. if `exit` is executed, it'll continue % to execute because there's the outer `loop`. { exit } if } loop
Handling for
Here's a few thing that needs to be careful about python's for
:
- there's binding involved. e.g.
for i in iter
binds value toi
. - the binding is not local to the
for
block.
So the following python program works perfectly fine:
# oh yes the `_` is not a "special" wildcard identifier either - it's just # a normal variable name like any other variable names. total_sum = 0 for i, _ in enumerate(range(10)): pass print(i, _) # prints "9 9"
for
with range
Note that postscript for
is "inclusive", i.e. when the looping variable (the i
part in for i in iter
) is exactly equal to the end value, the body is also executed, so while you calculate Sum(i=1→100)
in python like this:
total_sum = 0 for i in range(1, 101): total_sum += i
in postscript it's this:
/total_sum 0 def % note that it's 100 not 101 1 1 100 { total_sum add /total_sum exch def } for
The actual End
for postscript is a bit tricky to calculate; the easy way out is to once again using the equivalent while
. So the program:
for i in range(Start, End, Step): Body
will be seen as:
i = Start while i < End: Body i += Step
and thus results in:
/i Start def { i End lt not { exit } if { i End lt not { stop } if Body /i i Step add def } stopped { exit } if } loop
for
with enumerate
for
with enumerate
is a different story:
- Unlike
range
, this time we have no way to manually determine how to loop, which means we'll have to useforall
. - In python, an iterator of a dict is an iterator of that dict's keys, which means we'll have to (1) insert a
pop
in the front of the generated loop body whenever the iterator is a dict, or (2) generate the array of keys before hand, and in order to do it correctly in compile-time both method needs some form of type inference.- which means (2) is extra work no matter what, so we'll ditch that solution.
So a for
loop like this:
for Index, Value in enumerate(Iter): Body
can be translated into this:
% we also have to manage the index on our own. /Index 0 def { % if `Iter` is a dict then insert `pop` here. /Value exch def /Index Index 1 add def } forall
for
with other iterators
The translation of this kind of for
is similar to the translation of for
with enumerate
.
Comprehensions
Comprehensions can be done in a similar way of for
. The list comprehension form [Expr(Var) for Var in Iter if Cond(Var)]
is equivalent to:
_ = [] for Var in Iter: if Cond(Var): _.append(Expr(Var))
List comprehension forms with multiple for
-clause, e.g. [Expr(Var1, Var2) for Var1 in Iter2 if Cond1(Var1) for Var2 in Iter2 if Cond2(Var2)]
is equivalent to:
_ = [] for Var1 in Iter1: if Cond1(Var1): for Var2 in Iter2: if Cond2(Var2): _.append(Expr(Var1, Var2))
Dict comprehensions are similar but one faces the same numeric key problem as translating dict subscript (mentioned above). {Key(i): Value(i) for i in Iter if Cond(i)}
is equivalent to:
_ = {} for i in Iter: if Cond(i): _[Key(i)] = Value(i)
and thus can be translated as such.