A tiny parser in C++

Just before the end of the year, I would like to post here one small programming project on which I have been working over last month in my (more or less) free time. It is a Tiny Parser (TIPA) written in C++11 and here is the first version, still in alpha:

https://github.com/glipari/tipa

I designed this parser borrowing some of the concepts of the Boost::Spirit parser. However, unlike Spirit, my parser is simpler and almost entirely template-free. This means that it is easier to use, but also much less powerful than Spirit.

One thing I did not like about Spirit is its steep learning curve. Sometimes I was making a stupid mistake in declaring an object, but crawling through the many lines of the error messages provided by g++ was frustrating. Sometime my parser did not work but I couldn’t understand where was the problem. Maybe one of those BOOST_ADAPT macros? Or maybe by parser was not backtracking correctly? And then, I could never configure a decent error message for my parser. Not to forget the large amount of time needed to compile.

Instead, what I did like about spirit was its ability to write parser rules almost as one writes EBNF rules. I wanted that feature, and at the same time I wanted to avoid all the complexity.

So I wrote a library which provides this specific feature: to write a parser, you have to write your rules as a EBNF. Unfortunately, extracting the information from the parsed text is less automatic than Spirit: you have to write you “collecting” functions explicitely. However, this is not too difficult once you know how to do it. Maybe a little annoying, I know, but at least you get decent error messages when you do an error, and you can easily debug your code.

Here is an example of a simple calculator grammar:

    // These are the parsing rules
    rule expr, primary, term, 
	op_plus, op_minus, op_mult, op_div, r_int;
    
    // An expression is sequence of terms separated by + or -
    expr = term >> *(op_plus | op_minus);
    op_plus = rule('+') > term;    
    op_minus = rule('-') > term;

    // A term is a sequence of primaries, separated by * or /
    term = primary >> *(op_mult | op_div);
    op_mult = rule('*') > primary;
    op_div = rule('/') > primary;

    // A primary is either an integer or 
    // an expression within parenthesis
    primary = r_int | 
	rule('(') >> expr >> rule(')');

    // An integer is an integer!
    r_int = rule(tk_int);

Now, this is it. Notice that you can define and use a rule, and assign it a value later on. In this way, you can easily build recursive rules. (If you want a comparison with latest Spirit, see here: http://boost-spirit.com/home/2013/02/23/spirit-x3-on-github/).

Wait, this are only the rule we still have to parse a string. Here is how to do:

    // preparing the "context" of the parser
    parser_context pc;
    pc.set_stream(str);

    bool f = false;
    try {
	f = expr.parse(pc);
    } catch(parse_exc &e) {
	cout << "Parse exception!" << endl;
    }

Basically, you have to prepare an object of type “parser_context”, which will be passed along the parsing. This must be initialized with a stream object (in our code snippet, the str object). Then you take the “root” rule (in our case, the expr object), and call the parse() method, passing the context.

This parsing just returns true (if the expression was succesfully recognized by the parser) or false. If an error is found, it can also raise an exception, depending on the type of error (the error handling is still in a very preliminary status in the library).

Knowing if the expression is correct or not is indeed useful. However, most of the times we need to do something with the expression, for example computing the result, or simplifying it, or storing it somewhere. Therefore, how can we extract useful information during the parsing? Suppose you want to build a tree which represent our expression. Here we go: first, I define the classes that compose our expression tree,

class tree_node {
public:
    virtual int compute() = 0;
    virtual ~tree_node() {}
};

class op_node : public tree_node {
protected:
    shared_ptr<tree_node> left;
    shared_ptr<tree_node> right;
public:
    void set_left(shared_ptr<tree_node> l) {
	left = l;
    }
    void set_right(shared_ptr<tree_node> r) {
	right = r;
    }
};

class leaf_node : public tree_node {
    int value;
public:
    leaf_node(int v) : value(v) {}
    virtual int compute() { return value; }
};

#define OP_NODE_CLASS(xxx,sym)		 \
    class xxx##_node : public op_node {  \
    public:                              \
      virtual int compute() {            \
	int l = left->compute();         \
	int r = right->compute();        \
	return l sym r;                  \
      }                                  \
    }

OP_NODE_CLASS(plus,+);
OP_NODE_CLASS(minus,-);
OP_NODE_CLASS(mult,*);
OP_NODE_CLASS(div,/);

A leaf in this node is a simple integer. A non-leaf node is an operation between the left and right subtree. For example, if the node is a plus_node, then computing its value consists in computing the left and the right sub-trees first, and then summing the two. Hope it is clear!

Then I use an helper class which is used to build the tree incrementally.

class builder {
    stack< shared_ptr<tree_node> > st;
public:
    void make_leaf(parser_context &pc) {
	auto x = pc.collect_tokens();
        if (x.size() < 1) throw string("Error in collecting integer");
	int v = atoi(x[x.size()-1].second.c_str());
	auto node = make_shared<leaf_node>(v);
	st.push(node);
    } 

    template<class T>
    void make_op(parser_context &pc) {
	auto r = st.top(); st.pop();
	auto l = st.top(); st.pop();
	auto n = make_shared<T>();
	n->set_left(l);
	n->set_right(r);
	st.push(n);
    }
    
    int get_size() { return st.size(); }

    shared_ptr<tree_node> get_tree() {
	return st.top();
    }
};

It uses an internal stack where the partial results are stored. When an integer is found by the parser, we want to call the make_leaf() method which builds a leaf_node and push it into the stack. The integer is read from the parser_context() object using method pc.collect_tokens(). When an operation is found, the corresponding node is built using the make_op() template function, where template parameter T is the corresponding node operation.

Last thing we need to do is to call those methods at the right moment during parsing. Here is how to do it:

    builder b; 
    using namespace std::placeholders;

    r_int   [std::bind(&builder::make_leaf,           &b, _1)];
    op_plus [std::bind(&builder::make_op<plus_node>,  &b, _1)];
    op_minus[std::bind(&builder::make_op<minus_node>, &b, _1)];
    op_mult [std::bind(&builder::make_op<mult_node>,  &b, _1)];
    op_div  [std::bind(&builder::make_op<div_node>,   &b, _1)];

It’s not rocket science, don’t worry! I first create the builder object b. Then, I tell each rule which function must be called when the rule is successfully evaluated: but, since make_leaf() and make_op() are not simple functions, but class methods, I have to transform them into functions.

To do this I use the std::bind() library function. I need to specify an object (in this case it’s b), and bind it to the method. The standard library function std::bind() which takes the method, a pointer to the object, and I specify that the argument (of type parser_context) of the method becomes the first argument of the resulting function. It may look a little confusing at first, but I suggest you give a deeper look at the std::bind function description in the reference manual before trying to modify my code.

That’s it. You can find the code above in the library examples, as tipa/example/arithmetic.cpp

If you are curious and would like to give it a try, please download it from github and let me know what you think. The license is GPLv3. I am available to implement your requests for features if they are reasonably simple!

Happy new year!

2 thoughts on “A tiny parser in C++

Leave a comment