Abstract:
Four-part series about creating an expressing evaluator.

Created by Peter Kankowski
Last changed
Filed under Interpreters and compilers

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Adding variables to expression evaluator

In the previous part of expression evaluator series, we completed a simple compiler for arithmetic expressions. Now let's add support for variables and improve the code.

Lexer

The first improvement is separating lexer from parser. In the previous version, splitting the expression into tokens was intermixed with token processing. Now, there is a separate function, NextToken, which retrieves the next token. A token may be:

  • one of the following operators: + - * / = ( );
  • a floating-point number;
  • a name of variable.

How can you return all these kind of data types from NextToken? The function returns a char, which can be an operator or two special values (TOKEN_NUM and TOKEN_VAR). In the latter case, the calling function should look for the value in _number and _var variables.

Variables

The names and addresses of variables are stored in a hash table. A simple fixed-size (1024 elements) table and Bernstein hash function were used. Both provide good performance for this task.

The code to process variables is now very simple. When we encounter a variable name, we check if it is an assignment or usage of previously assigned variable. If the variable is assigned, we emit the code to calculate the expression and store the result in memory. If the value of variable is used in the expression, we just load it from memory:

if(_cur_token == TOKEN_VAR) {
    EVAL_CHAR* var_name = _op_start;
    if( NextToken() == '=') {
        double** stored_to = _var;
        NextToken();
        ParseSummands();
        *stored_to = _code.EmitStoreNumber(ExprCode::P_LEAVE);
    } else {
        if(NULL == *_var) {
            _op_start = var_name;
            ReportError(EEE_UNDEF_VAR);
        }
        _code.EmitLoadVariable(*_var);
    }
    if(negative)
        _code.EmitOperation(OP_CHANGE_SIGN);
}

The evaluator accepts C-style assignments like a = b = 2 or 3 + b = 2 (which means b = 2 and print 3 + b). We could implement Basic-style assignment, which allows assignment only to one variable in an expression, but this would require more complex code. We would have to push token back if the variable is the first token, but there is no assignment (e.g., a + 5). So, simpler C-style assignments were chosen.

As an exercise, you can implement pre-defined variables. For example, add pi variable for "π" number.

Conclusion

This code illustrates some techniques widely used in compilers. Most translators (including commercial C++ compilers) store variables in hash tables and separate lexer from parser. By studying this toy example, you can learn something about "real" compilers.

Download the expression evaluator, 43 KB

 

The previous part: Reverse Polish notation

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

16 comments

Michael,
Awesome article, thanks for writing this series.

I've thoroughly enjoyed the education on machine code, which I'm applying to AIMGP.
Peter Kankowski,
Thank you for kind words :). I will continue the series.

And what is AIMGP?
Michael,
It stands for Automatic Induction of Machine-code Genetic Programming, it's a high performance technique for implementing Linear Genetic Programming.

It's an AI search algorithm where you generate machine code "organisms", test their fitness through execution of the raw machine code(on a set of test data), then perform crossover/mutation etc on the fitter organisms.

One of the more complex parts is learning machine code and generating it on the fly at run time.
ace,
One potentially interesting tool.

A parser/lexer/tree generator written with the goal to be better than all yacc-based-and-like tools, written in Java but with the C target possible. BCD license.

http://www.antlr.org/

I haven't tried it. The C runtime has around 10000 lines of C (!) which I consider a little too much for something small. But maybe it's good for experimenting or doing something big.
Peter Kankowski,
Expression evaluator in ANTLR
Anthony Akentiev,
Thank you, Petya! Can't wait to see upcoming articles. Well done. There were not so much new information to me, but it was still interesting.

p.s.
I love Red Dragon book (it is one of my fav. computer books) and i don't think that it is so "boring". I highly recomend it even as a starter.
Keny Ordaz,

Thank you, Petya! Do you think that implementing FCOMP could be easy? I mean for introducing > or < or ==, and to result 0 for false and 1 for true.

Thanks again!

ace,

Keny, your request sounds like you've got a homework assignment and now you expect somebody to do it for you.

Peter Kankowski,

IMHO, Keny asked an interesting question. The obvious solution is:

; st0 and st1 contain the numbers to compare.
; If they are equal, 1.0 is returned; if not, 0.0 is returned.
fcompp
fnstsw ax
shr ax, 14
and ax, 1
mov [tmp], ax
fild word[tmp]

However, FSTSW is slow (see Agner Fog's manuals). A better solution uses FCOMIP and FCMOVcc instructions (available on Pentium Pro and higher):

fcomip st0, st1
fstp st0
fld1
fldz
fcmove st0, st1
fstp st1

It's both shorter (less code bytes) and faster.

ace,

I understood the question as a request for the expression evaluator which evaluates relational operators.

Peter Kankowski,
The expression evaluator should generate this code snippet for == operation. For < and >, it should be modified (FCMOVB and FCMOVNBE instead of FCMOVE).
Darko Nikolic,

Thank you for this very good article series.

I have a questions considering the STRINGS or ARRAYS, because you have shown how to deal with numbers INTEGERS and DOUBLES, but ARRAYS of any type are also interesting.

How to allocate a large STRING or any ARRAY type variable in .data segment?

Should I allocate more pages from memory to hold this?

Can I dynamically allocate additional memory if needed during run-time?

Thank you very much for your answers!

Regards,

Darko

Peter Kankowski,
Please read the latest article about dynamic arrays.
Darko Nikolic,

My question was connected to my current work.

I am building a scripting engine and execution of native code from memory buffer sounds very good, considering the speed.

It looks like that Google Chrome V8 JavaScript engine is built on this idea.

So, that is what I was wondering while reading your articles.

Can this be applied to the scripting subsystem?

It looks like it can, and that it should be done in this manner.

Thank you again.

Regards,

Darko

Peter Kankowski,

Darko, sorry, I misunderstood your question. You can define operations on strings and arrays in the expression evaluator. For example, you can write functions for allocating, concatenating, and slicing the strings. The expression compiler will generate calls to these functions. String literals can be stored in a preallocated array, and string variables can be allocated at runtime.

In a dynamically typed scripting language, you will also have to check types of the variables, generate function prologues and epilogues, allocate local variables on stack, etc. It's more complex than a p-code virtual machine, but IMHO, it's possible. Few years ago, I wanted to write a simple version of such compiler myself.

Darko Nikolic,

Thank you, Peter, for your answer.

Regards,

Darko

Your name:


Comment: