Full parser tutorial
From Gallium
You can use Camlp4 as a complete parser (not a syntax extension). This means you will not start from some variant of OCaml but rather parse a completely new language. The backend will still be OCaml code and the OCaml compiler though. This can be used to quickly build a DSL compiler on top of ocamlc.
As an example, we will make a compiler for a very simple language consisting of strings, variables and print statements:
def name = "world" print "hello", name
This program will print the message "hello world".
Contents |
Preliminaries
AST
First, we define an AST for this language:
type 'loc expr = Var of string * 'loc | String of string * 'loc ;; type 'loc stmt = Def of string * 'loc expr * 'loc | Print of 'loc expr list * 'loc ;;
An expression is either a variable reference or a string; a statement is either a variable definition or a print of some expressions. We want to associate a source location with every AST node; however we want to abstract over the concrete type of location that is used. We therefore declare loc
as a type parameter for the type constructors expr
and stmt
.
Translation
We then define how such an AST is translated into an OCaml AST. The quotations we will use for this implicitly expect an Ast
module with OCaml AST types. For now, we will work with the version inside the PreCast
module:
open Camlp4.PreCast ;;
The translation is very straightforward. We translate an expression into an OCaml expression, and we translate a statement into an OCaml structure item; we also define an auxiliary function to produce a string expression that concatenates a list of string expressions, interspersed with spaces:
let expr_of_expr = function | Var (v,_loc) -> <:expr< $lid:v$ >> | String (s,_loc) -> <:expr< $str:s$ >> ;;
let concat_exprs _loc = function | [] -> failwith "concat_exprs" | e::es -> List.fold_left (fun e e' -> <:expr< $e$ ^ " " ^ $e'$ >>) e es ;;
let str_item_of_stmt = function | Def (v,e,_loc) -> <:str_item< let $lid:v$ = $expr_of_expr e$ ;; >> | Print (es,_loc) -> let es = List.map expr_of_expr es in <:str_item< print_endline $concat_exprs _loc es$ ;; >> ;;
Notice that the locations stored in the AST nodes are captured using the name _loc
, which is the location used implicitly for the quotations. As such, the resulting OCaml AST nodes can be associated to the source location that caused it.
Parser
Next, we also define a grammar that can be used to parse our minimal language and produce an AST. We define two entry points (expr
and stmt
) and we extend these enty points with a rule for each possible case.
let expr = Gram.Entry.mk "expr" ;; let stmt = Gram.Entry.mk "stmt" ;;
EXTEND Gram expr: [ [ v = LIDENT -> Var (v,_loc) | s = STRING -> String (s,_loc) ] ]; stmt: [ [ "def"; v = LIDENT; "="; e = expr -> Def (v,e,_loc) | "print"; es = LIST1 expr SEP "," -> Print (es,_loc) ] ]; END ;;
We then have to integrate these elements. The grammar should be used to parse the input; the resulting AST should be translated to an OCaml AST and this OCaml AST should be outputted.
There are three ways of doing this integration:
Flushing the initial grammar but keeping the rule structure and names
If we keep the rule structure, we can use the camlp4 program as a driver. However, we have to override the OCaml grammar at an appropriate point to install an entirely different language. To find this point, we will do some inspection of the grammar from the toplevel of the main entry points: for an implementation (.ml file) this is Syntax.implem
, for an interface (.mli file) this is Syntax.interf
, and for code entered at the toplevel, this is Syntax.top_phrase
. Interfaces are beyond the scope of this tutorial, so we will focus on the other two:
$ ocaml camlp4o.cma # open Camlp4.PreCast ;; # Gram.Entry.print Format.std_formatter Syntax.implem ;; implem: [ LEFTA [ "#"; a_LIDENT; opt_expr; semi | EOI | str_item; semi; SELF ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.top_phrase ;; top_phrase: [ LEFTA [ EOI | phrase ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.phrase ;; phrase: [ LEFTA [ "#"; a_LIDENT; opt_expr; semi | str_item; semi ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.str_item ;; str_item: [ "top" LEFTA [ "let"; "module"; a_UIDENT; module_binding0; "in"; expr | "let"; opt_rec; binding; "in"; expr | "let"; opt_rec; binding ...
What this basically means, is that implementation is a sequence of structure items and directives (terminated by optional semicolons), and that a toplevel phrase is one structure item or directive (or the end-of-input). While directives might still be relevant in the context of our language, structure items are OCaml-specific. We therefore choose to override the grammar at the level of str_item
. This influences both toplevel and implementation code.
Without functors
This is the most straightforward. We complete our code from above as follows:
Gram.Entry.clear Syntax.str_item ;; EXTEND Gram Syntax.str_item : [ [ s = stmt -> str_item_of_stmt s ] ]; END ;;
We first flush the existing definitions of str_item
and then install a new rule: we parse a statement and translate it to a structure item.
With functors (using Register.OCamlSyntaxExtension)
To do the same with functors instead of the PreCast
module, we have to start our definition differently. Instead of opening that module in the above code, we will write the following things.
First we import the camlp4 module types (such as Id
, Loc
, Ast
, Camlp4Syntax
etc.), and the camlp4 token constructors (such as LIDENT
, STRING
, etc).
open Camlp4.Sig ;;
Next we define an Id
module for our syntax (which camlp4 still views as a syntax extension):
module Id = struct let name = "MinimalLanguage" ;; let version = "0.1" ;; end ;;
We can then open our syntax extension module. It is a functor that defines a Camlp4Syntax
in terms of a given Camlp4Syntax
. We start by opening the given syntax:
module Minimal (Syntax : Camlp4Syntax) = struct open Syntax;;
Importing Syntax
makes a number of modules available, most importantly Ast
, Loc
and Gram
. That means we can include the definition of the translation and the parser unmodified.
After those definitions, we can override the str_item
entry point as before. Since we already opened the Syntax
module, we do not need to include that module name in the references to str_item
:
Gram.Entry.clear str_item ;; EXTEND Gram str_item : [ [ s = stmt -> str_item_of_stmt s ] ]; END ;;
Finally, we close our syntax extension module with:
include Syntax ;; end ;;
Since we include the original Camlp4Syntax
module at the end, our syntax extension functor defines the same syntax as the one that it took as an argument. Nonetheless, the grammar modification statements will be applied at functor application time. (We could have just included the Syntax
in the beginning of the definition of Minimal
, but then our definition of expr
would have overridden the existing expr
value and this would have clashed.)
Finally, we register this syntax extension and its Id
module. This occurs as a side-effect of a functor application, so we can do it as a local module definition that is never used.
let module M = Camlp4.Register.OCamlSyntaxExtension(Id)(Minimal) in () ;;
Usage
We can compile the syntax extension (defined either with or without functors) using camlp4of
:
ocamlc -c -I +camlp4 -pp camlp4of minimal.ml
Afterwards, it can be used with the camlp4o
binary. For example, if test.ml
contains the example program given at the beginning of the article, we can see the generated code:
$ camlp4o minimal.cmo test.ml let name = "world" let _ = print_endline ("Hello" ^ (" " ^ name))
Or compile and run it:
$ ocamlc -o test -pp "camlp4o minimal.cmo" test.ml && ./test Hello world
Together with camlp4o.cma
, it can be used to create a toplevel for our language:
$ ocaml camlp4o.cma minimal.cmo # def name = "world" ;; val name : string = "world" # print "Hello", name ;; Hello world - : unit = () # 1+1 ;; Parse error: illegal begin of top_phrase
Notice that semantic errors detected by OCaml will be reported with appropriate source locations, for example, a program with the single line:
print foo
will produce the error:
$ ocamlc -c -pp "camlp4o minimal.cmo" foo.ml File "foo.ml", line 1, characters 6-9: Unbound value foo
Make a new grammar using the same lexer and token type
open Camlp4.PreCast;; module Gram = MakeGram(Lexer);; ...
Then you have to write a main function for your parser, camlp4 cannot do it for you.
Make a new lexer and token type and then make a new grammar
...