.

Tags:

This is the second part of my weekly series on how to write a math expression evaluator in JavaScript. If you miss the first part, which was about lexical analysis, go and read it first and come back later.

In this installment, we will see the actual parsing process. Based on a sequence of tokens obtained from the lexer, we shall be able to match these tokens to the expression grammar which describes the syntax of the expression. At the end of the process, we want to produce a syntax tree. This entire procedure is commonly known as syntactic analysis.

An abstract syntax tree (AST) represents the syntactic structure of the expression. Consider the following expression:

x = -6 * 7

The associated syntax tree for the above expression should look like:

The above syntax tree matches our logic if we are supposed to evaluate the expression (if we handle every node in the tree using depth-first traversal): take 6, negate it, multiply the result with 7, put the result into x. Unless you don’t pass primary school math, your brain should magically gives -42 as the answer.

Of course, the question is how to create the tree given a list of tokens associated with the expression? There are multiple strategies to analyze the expression’s syntax. The one which I illustrate below is commonly known recursive descent parser. It may not be the fast one in some cases, but it is very illustrative and quite easy to trace should you want to follow the parsing logic.

Again, the following paragraphs explain and discuss various code fragment only (error checkings may be omitted). The full code is available for your pleasure from the TapDigit project.

The main entry point for the parsing looks like this. The lexer itself comes from the implementation of the lexical analyzer discussed in part 1.

function parse(expression) {
    var expr;
 
    lexer.reset(expression);
    expr = parseExpression();
 
    return {
        'Expression': expr
    };
}

From this, we go to the main parseExpression function which is surprisingly simple. This is because our syntax only implies a variable assignment as an expression. For other languages with more elaborate control flow (branching, loop, etc) or some form of domain-specific language (DSL), assignment may not be the only form of expression.

function parseExpression() {
    return parseAssignment();
}

For the next subsequent parseFoo variants, we need a function which can match an operator. If the incoming operator is the same as the expected one, then it returns true.

function matchOp(token, op) {
    return (typeof token !== 'undefined') &&
        token.type === T.Operator &&
        token.value === op;
}

An example form of assignment is x = 42. However, we also want to tackle the case where the expression is also as plain as 42 or a nested assignment such as x = y = 42. See if you can understand how the implementation of parseAssignment below handles all the three cases (hint: recursive is a possibility).

function parseAssignment() {
    var token, expr;
 
    expr = parseAdditive();
 
    if (typeof expr !== 'undefined' && expr.Identifier) {
        token = lexer.peek();
        if (matchOp(token, '=')) {
            lexer.next();
            return {
                'Assignment': {
                    name: expr,
                    value: parseAssignment()
                }
            };
        }
        return expr;
    }
 
    return expr;
}

The function parseAdditive processes both addition and subtraction, i.e. it creates a binary operator node. There will be two child nodes, the left and the right ones. They represent the two subexpression, further handled by parseMultiplicative, to be added or subtracted. Update: the previous version of the function was copied from a wrong implementation, this has been fixed since.

function parseAdditive() {
    var expr, token;
 
    expr = parseMultiplicative();
    token = lexer.peek();
    while (matchOp(token, '+') || matchOp(token, '-')) {
        token = lexer.next();
        expr = {
            'Binary': {
                operator: token.value,
                left: expr,
                right: parseMultiplicative()
            }
        }
        token = lexer.peek();
    };
    return expr;
}

The same logic follows for parseMultiplicative below. It does handle both multiplication and division.

function parseMultiplicative() {
    var expr, token;
 
    expr = parseUnary();
    token = lexer.peek();
    while (matchOp(token, '*') || matchOp(token, '/')) {
        token = lexer.next();
        expr = {
            'Binary': {
                operator: token.value,
                left: expr,
                right: parseUnary()
            }
        };
        token = lexer.peek();
    }
    return expr;
}

Before we go and check the details of parseUnary, you may wonder why parseAdditive is first called and then parseMultiplicative. This is done in order to satisfy the operator precedence requirement. Consider the expression 2 + 4 * 10 which actually evaluates to 42 (multiply 4 by 10, then add 2) rather than 60 (add 2 to 4, multiply by 10). This is only possible if the topmost node in syntax tree is binary operator + which has two child nodes, the left one is just the number 2 and the right node is actually another binary operator *. The latter holds two numbers as the corresponding child nodes, 4 and 10.

To handle a negation, like -42, we use the concept of unary operation. In the syntax tree, this is represented by a unary operator node and it has only one child node (hence the name). While negation is one form of unary operation, there is also a case such as +42 which we need to take into account. Thanks to the recursive nature, expression like ----42 or even -+-+42 can be handled without any problem as well. The code to handle the said unary operation is as simple as the following:

function parseUnary() {
    var token, expr;
 
    token = lexer.peek();
    if (matchOp(token, '-') || matchOp(token, '+')) {
        token = lexer.next();
        expr = parseUnary();
        return {
            'Unary': {
                operator: token.value,
                expression: expr
            }
        };
    }
 
    return parsePrimary();
}

Now here comes one of the most important function of all: parsePrimary. First of all, let’s see four possible forms of primary node:

  • an identifier (basically referring to a variable in this context), e.g. x
  • a number, e.g. 3.14159
  • a function call, e.g. sin(0)
  • another expression enclosed in brackets, e.g. (4 + 5)

Fortunately, deciding whether the incoming tokens will form one of the above possibilities is rather easy as we just need to examine the token type. There is only ambiguity between an identifier and a function call, which can be solved if we peek at the next token, i.e. whether it is an open bracket or not. Without further ado, here is the code:

function parsePrimary() {
    var token, expr;
 
    token = lexer.peek();
 
    if (token.type === T.Identifier) {
        token = lexer.next();
        if (matchOp(lexer.peek(), '(')) {
            return parseFunctionCall(token.value);
        } else {
            return {
                'Identifier': token.value
            };
        }
    }
 
    if (token.type === T.Number) {
        token = lexer.next();
        return {
            'Number': token.value
        };
    }
 
    if (matchOp(token, '(')) {
        lexer.next();
        expr = parseAssignment();
        token = lexer.next();
        if (!matchOp(token, ')')) {
            throw new SyntaxError('Expecting )');
        }
        return {
            'Expression': expr
        };
    }
 
    throw new SyntaxError('Parse error, can not process token ' + token.value);
}

Now the remaining part is parseFunctionCall. If we see an example of a function call like sin(0), it basically consists of a function name, open bracket, function argument, and close bracket. It is important to realize that there can be more than one arguments (foo(1, 2, 3)) or no argument at all (random()), depending on the function itself. For simplicity, we split the handling of function argument to parseArgumentList. Here are both functions for your pleasure:

function parseArgumentList() {
    var token, expr, args = [];
 
    while (true) {
        expr = parseExpression();
        if (typeof expr === 'undefined') {
            break;
        }
        args.push(expr);
        token = lexer.peek();
        if (!matchOp(token, ',')) {
            break;
        }
        lexer.next();
    }
 
    return args;
}
 
function parseFunctionCall(name) {
    var token, args = [];
 
    token = lexer.next();
    if (!matchOp(token, '(')) {
        throw new SyntaxError('Expecting ( in a function call "' + name + '"');
    }
 
    token = lexer.peek();
    if (!matchOp(token, ')')) {
        args = parseArgumentList();
    }
 
    token = lexer.next();
    if (!matchOp(token, ')')) {
        throw new SyntaxError('Expecting ) in a function call "' + name + '"');
    }
 
    return {
        'FunctionCall' : {
            'name': name,
            'args': args
        }
    };
}

Voila! That’s all our parser code. When combined properly into a functional object, it is just about 200 lines of code, supporting various math operations with proper precedences, brackets, variables, and function calls.

Live demo is available at tapdigit.googlecode.com/git/parser.htm (most common desktop and mobile browsers are supported).

For the last part next week, we will see the final missing puzzle necessary to be able to evaluate the syntax tree. Update: the final part (on building the interpreter) has been published!

  • http://dacresni.github.com dacresni

    you may be interseted in a parser tree graph . I haven’t had time to clean it up but go to the cleanup branch/tag of https://github.com/dacresni/parse-tree-grapher outputs to latech, I want to output to javascript and make that the display.

  • http://dacresni.github.com dacresni

    you may be interseted in a parser tree graph . I haven’t had time to clean it up but go to the cleanup branch/tag of https://github.com/dacresni/parse-tree-grapher outputs to latech, I want to output to javascript and make that the display.

  • Masklinn

    Since you’re parsing expressions (and pretty much only expressions), you should look into Pratt’s top-down operator parser.

    I’m currently using one to implement an evaluator for Python expressions, and FWIW here’s the implementation for parsing the “+” operator, including prefix and infix uses:

    {
    id: ‘+’,
    bp: 110,
    nud: function () {
    this.left = expression(130);
    return this;
    },
    led: fuction (left) {
    this.left = left;
    this.right = expression(this.bp);
    }
    }

    (well that’s not completely true, the implementation of “+” actually is only 2 function calls: most operators will precisely follow the pattern above changing the values for `id` and `bp` so it’s easy to create factory functions and avoid having to type the same thing 40 times in a row).

    If you’re interested, I recommend this treatment: http://effbot.org/zone/simple-top-down-parsing.htm the code is in Python but very easy to follow and transcribe to Javascript. An other option is http://javascript.crockford.com/tdop/tdop.html though it’s noticeably more complex (it parses pretty much all of javascript and handles scopes as well).

    • ariya

      While I played with and had to implemented different parser techniques in the past, I still believe that the recursive descent is the best for a tutorial (i.e. the main goal of this blog post). It is easy to follow and easy to debug!

      Having said that, experienced developer should look at Crockford’s implementation of top down parsing. It is beautiful and compact, and it’s still my most favorite parsing code so far.

  • Masklinn

    Since you’re parsing expressions (and pretty much only expressions), you should look into Pratt’s top-down operator parser.

    I’m currently using one to implement an evaluator for Python expressions, and FWIW here’s the implementation for parsing the “+” operator, including prefix and infix uses:

    {
    id: ‘+’,
    bp: 110,
    nud: function () {
    this.left = expression(130);
    return this;
    },
    led: fuction (left) {
    this.left = left;
    this.right = expression(this.bp);
    }
    }

    (well that’s not completely true, the implementation of “+” actually is only 2 function calls: most operators will precisely follow the pattern above changing the values for `id` and `bp` so it’s easy to create factory functions and avoid having to type the same thing 40 times in a row).

    If you’re interested, I recommend this treatment: http://effbot.org/zone/simple-top-down-parsing.htm the code is in Python but very easy to follow and transcribe to Javascript. An other option is http://javascript.crockford.com/tdop/tdop.html though it’s noticeably more complex (it parses pretty much all of javascript and handles scopes as well).

    • ariya

      While I played with and had to implemented different parser techniques in the past, I still believe that the recursive descent is the best for a tutorial (i.e. the main goal of this blog post). It is easy to follow and easy to debug!

      Having said that, experienced developer should look at Crockford’s implementation of top down parsing. It is beautiful and compact, and it’s still my most favorite parsing code so far.

  • Pingback: math evaluator in javascript: part 1 (the tokenizer) | don't code today

  • Pingback: math evaluator in javascript: part 1 (the tokenizer) | don't code today

  • http://www.kazlauskas.me/ nagisa

    I was checking out your tutorial, and noticed one problem:
    Binary:
    operator: -
    left:
    Number: 100
    right:
    Binary:
    operator: -
    left:
    Number: 10
    right:
    Number: 10This is AST you get by parsing 100 – 10 – 10. Now guess what’s the result.

    • http://www.kazlauskas.me/ nagisa

      Oh, forgot about Disqus terrible formatting…

  • http://www.kazlauskas.me/ nagisa

    I was checking out your tutorial, and noticed one problem:
    Binary:
    operator: -
    left:
    Number: 100
    right:
    Binary:
    operator: -
    left:
    Number: 10
    right:
    Number: 10This is AST you get by parsing 100 – 10 – 10. Now guess what’s the result.

    • http://www.kazlauskas.me/ nagisa

      Oh, forgot about Disqus terrible formatting…

  • Andrew Webb

    Hi nice article!

    I have Java and C++ implementations and downloads that also accomplish this using the shunting yard algorithm

    http://www.technical-recipes.com/2011/a-mathematical-expression-parser-in-java-and-cpp/

    Comments and suggestions welcome.