Joanju - Why Proparse?

We've heard through the grapevine (i.e. internet) that a common question about the use of Proparse for Prolint is: Why? Why not just use grep/sed/awk/perl… or even 4GL? What do we need a parser for?

Bold points are from Jurjen Dijkstra, I will expand on those.

Nodes are organized in a tree, somewhat similar to a Treeview or an XML file. There are several functions to navigate from one node to its parent, its siblings, and its children.

Think about this: How easy is it to find where a PROCEDURE block begins and ends when using a parser? Trivial. How hard is it using perl? Your head should start to hurt if you think about that long enough. Each node may have additional attributes to store convenient information about that node.

The parser "marks up" the nodes in the tree with attributes. For example, given a node which is a record name in a FIND statement, does the record name represent a database schema table, a work-table, or a temp-table? Subsequent tools like Prolint add more and more information to the tree until we have everything that we need to know in order to do the analysis tasks at hand. Node types are never abbreviated, even if keywords in the source are abbreviated.

Actually, the node types are integers, but can be worked with either as integers or as strings. From the 4gl, we usually work with strings. If you want to find all DEFINE nodes, you just find all nodes with type equal "DEFINE". You don't have to worry if it's actually "def" in the source code. Whitespace and comments do not affect the tree.

Whitespace and comments have no impact on the semantics of the compile unit, and they are not nodes in the tree. (There's some neat tricks that the parser plays though, so that given a node, you can examine the whitespace or comments that comes before it.) The key here is, if you want to find all instances of DEFINE SHARED VARIABLE, you don't have to worry about those words being separated by whitespace or comments. Preprocessor references and include files are fully expanded.

If you need to find all instances of DEFINE SHARED VARIABLE, something like "def {&shared} var" would certainly leave you in a lurch. You would have to be running your "perl" tools against COMPILE..PREPROCESS output, but then, once you found DEFINE SHARED VARIABLE, how would you know what file or line it came from? That gets into examining the COMPILE..LIST, which is its own can of worms. The parser makes it easy. Let's have a look at a trivial output from the parser for the following program:

run test.
procedure test:
  disp 1 + 2.
end procedure.

And here's the output from the parser, which gives an idea of the shape of the tree. Each of the following lines represents a node in the tree:

Program_root    
    RUN    run
        FILENAME    test
        PERIOD    .
    PROCEDURE    procedure
        ID    test
        LEXCOLON    :
        Code_block    
            DISPLAY    disp
                Form_item    
                    PLUS    +
                        NUMBER    1
                        NUMBER    2
                PERIOD    .
        END    end
            PROCEDURE    procedure
        PERIOD    .
    Program_tail

You can see where a procedure begins and ends. You can see where a display statement begins and ends. You can easily see what the two operands are of a "plus" operator. The first token on each line is the node's type (string value). For some nodes, we also see another attribute - "node text", which is the text from the source code. Most nodes have anything from a few other attributes like filename and line number, right through to several other attributes, like "store type" (database, temp-table, or work-table).

Certainly there are lint-like activities which could be done by using regular expressions, but a quick peek at some Prolint rules should give some ideas why a parser is necessary.

Let's start with this example: We want to find all instances in the code like this:
"The sales rep for " + cust-name + " is " + salesrep + "."
because for translation (i18n) purposes, it should instead look like this:
SUBSTITUTE("The sales rep for &1 is &2.", cust-name, salesrep)
The parser builds a tree out of the code, and for expressions, the tree is structured in a fashion which makes the expression easy to evaluate. In the case of operators like "+", the "+" node is actually the parent to the two operands. One of those operands may actually be another "+" node, etc.

Prolint uses a parser query to find all "+" nodes. It then examines the children of that "+" node to determine: if there are more than one translatable quoted strings in the expression, and also, that there is another "+" node in the expression. That's it - it really is that simple. Now, imagine trying to get grep, or even a perl script, to figure that out!

Here's another one:
x eq 12.
That's the entire statement. It is an equality check - it doesn't actually do an assignment, and it's probably a bug. The entire statement is just an expression - it's not actually a proper statement like a DISPLAY or an ASSIGN statement. The parser makes structured branches out of every statement, and in order for it to make a consistently structured branch for expression statements like this, it puts a special node type at the top of the branch. Those nodes are easy for Prolint to find, again with a simple query.