Syntax extension tutorial
From Gallium
Where does the universe come from? Why are we here? Those questions cannot be easily answered. However, "How can I make a syntax extension using Camlp4?" is answerable!
Contents |
Aim
What we want to do here is to create a calculator, which is something like the all-time favorite of parsers, grammars, etc. We will do that by making a module that can understand (parse) and compute expressions, and which we will plug in Camlp4. Wait. How? Let's see!
Setting things up
A Camlp4 syntax extension comes as a bytecode module (.cmo) or a library (.cma) that Camlp4 dynamically loads. This module (or set of modules in the case of library) extends the syntax understood by Camlp4. Hence our calculator will be a syntax extension too!
A couple of things to consider
- how to name our syntax extension?
- what language(s) should we use? Basically, we have three choices: (see Using Camlp4 for more details)
- what syntax should we extend?
What we gonna do
- using Camlp4 3.09 conventions, we will call our module pa_calculator.
- for know, let's stick to the usual (original) syntax: we will use the camlp4of compiler. You will have plenty of time to learn the revised syntax later, if you wish to.
- since original and revised parsers already know how to parse expressions, we will start from an empty grammar.
What you gonna do
- go and create a file named "pa_calculator.ml"
- compile it:
Note the "-I +camlp4", which will be necessary as soon as we start using Camlp4 modules. Also note the "-pp camlp4of.opt", which tells ocamlc to pre-process our code with camlp4of.opt (the native version of camlp4of)ocamlc -I +camlp4 camlp4lib.cma -pp camlp4of.opt pa_calculator.ml
- execute it:
This loads the pa_calculator module inside Camlp4, and (currently) does nothing more. Anyway, we now have the basis for working with our calculator! So, as they say on this crappy TV show, "Next !"camlp4 -parser pa_calculator.cmo
Creating the syntax extension skeleton
Our syntax extension is meant to be simple, so we use the quick and non-extensible way.
Preamble
open Camlp4.PreCast
This should be the first line of pa_calculator. This creates everything we need in Camlp4:
- locations (module Loc)
- lexer (module Lexer)
- grammar (module Gram)
- and other things we do not really care for now :-)
Creating our main rule (entry point)
let expression = Gram.Entry.mk "expression"
This creates an entry named "expression". Note that the actual entry and its name are independent, but it seems good practice to use the same string (would it really make sense anyway if you created an "open_declaration" entry whose name is "class_method"? Don't think so).
Extending the grammar
The default grammar loaded by Camlp4.PreCast is contained in the module Gram.
EXTEND Gram
This statement tells Camlp4 we start to extend the "Gram" grammar.
GLOBAL: expression;
GLOBAL declares "expression" as being a non-local grammar entry, while other entries will be local only. Note that if you don't write a GLOBAL statement, all entries are considered global.
Expressing our main entry
expression: [ [ INT ] ];
Here we defined the "expression" entry with only one rule, which just accepts an INT token. We will add more rules in the next section.
Ending the grammar extension
END
We are done with extending the "Gram" grammar.
Calling the parser
let _ = Gram.parse_string expression (Loc.mk "<string>") (read_line ())
This is the entry point of our module. It just parses the string read from stdin as an expression. The Loc.mk function creates a location, using the string provided as its name (so here it is "<string>"). This location will be used to report errors; for example if you replace type " a", this is what you obtain:
$ camlp4 -parser pa_calculator.cmo
a
File "<string>", line 1, characters 3-4:
Parse error: illegal begin of expression
An error occurs because our syntax extension only accepts integers as valid expressions. So... let's extend our expression entry!
Extending the expression rule
Now that we have a working parser, we want to be able to parse something more than just integers. We'll start by parsing additions and subtractions.
Sum and difference
Adding rules
expression:
[ [ SELF; "+"; SELF
| SELF; "-"; SELF
| INT] ];
There are two new things here: first, inside the brackets, we have replaced the single rule [ [ INT ] ] by several rules separated by "|". Second, the keyword SELF that is a synonym for "the entry being defined"; this can be particularly useful when changing the entry name.
Getting a result
So far we can parse things such as "1 + 2 + 3", but that is of little use since we do not compute any result. Assuming we want only integers, we can return an integer result this way:
expression:
[ [ x = SELF; "+"; y = SELF -> x + y
| x = SELF; "-"; y = SELF -> x - y
| x = INT -> int_of_string x ] ];
Here we use the result of expression (x = SELF), and associate actions with rules (with "->"). This results in expression now returning an integer.
Displaying the result
This is now obvious: just add a call to "print_int"!
print_int
(Gram.parse_string expression
(Loc.mk "<string>") (read_line ()))
Product and quotient
Same steps with "*" and "/":
expression:
[ [ x = SELF; "+"; y = SELF -> x + y
| x = SELF; "-"; y = SELF -> x - y
| x = SELF; "*"; y = SELF -> x * y
| x = SELF; "/"; y = SELF -> x / y
| x = INT -> int_of_string x ] ];
If you're familiar with parsers, you have already seen what is wrong here :) Otherwise, compile, execute, and try something like "1 + 2 * 3".
While the result should be 7, it is 9. As Camlp4 has no way of telling which rule to give priority to, it just takes the first one. So "1 + 2 * 3" is taken as "(1 + 2) * 3". Let's see how we can overcome that.
Managing priorities
Since times (*) and slash (/) have priority over plus (+) and minus (-), we need to explain that to Camlp4. This is done using levels.
Adding levels
expression:
[ [ x = SELF; "+"; y = SELF -> x + y
| x = SELF; "-"; y = SELF -> x - y ]
| [ x = SELF; "*"; y = SELF -> x * y
| x = SELF; "/"; y = SELF -> x / y ]
| [ x = INT -> int_of_string x ] ];
We have splitted the entry in 3 levels: the first one for additions/subtractions, the second one for multiplications/divisions, and the third one for integers. When encountering "1 + 2 * 3", the parser will first try to see it as a sum of two expressions, "1" and "2 * 3", and will output the correct result.
If you invert the first and second levels, you get the same behaviour as in the previous section.
Names
We can name levels:
expression:
[ "add"
[ x = SELF; "+"; y = SELF -> x + y
| x = SELF; "-"; y = SELF -> x - y ]
| "mul"
[ x = SELF; "*"; y = SELF -> x * y
| x = SELF; "/"; y = SELF -> x / y ]
| "simple"
[ x = INT -> int_of_string x ] ];
Associativity
Additionally, we can specify the associativity of levels, the default being LEFTA, which is the most natural. Associativity defines how expressions such as "1 + 2 + 3" should be understood:
- using left associativity, this is equivalent to "(1 + 2) + 3"
- using right associativity, this is equivalent to "1 + (2 + 3)".
For such things as addition, or multiplication, this does not matter. For substraction and division they are left associatives so the default is the good one. Let's define a new operator which is right associative: the pow operator, which we define as "**":
expression:
[ "pow" RIGHTA
[ x = SELF; "**"; y = SELF -> int_of_float ((float x) ** (float y)) ]
| "add" LEFTA
[ x = SELF; "+"; y = SELF -> x + y
| x = SELF; "-"; y = SELF -> x - y ]
| "mul" LEFTA
[ x = SELF; "*"; y = SELF -> x * y
| x = SELF; "/"; y = SELF -> x / y ]
| "simple" NONA
[ x = INT -> int_of_string x ] ];
The RIGHTA is important since if you try to compute something like 223 (28 = 256), you write "2 ** 2 ** 3", and you obtain 256 with RIGHTA and 64 with LEFTA.
And then?
By now you should be familiar with the basic concepts of syntax extension. If you want to know more about it, you may consult the Syntax extension page.
If you want to generate OCaml code, you'd probably like to check OCaml code generation tutorial.