YaccConstructor


RNGLR

Module purpose

This generator creates GLR parser, corresponding to described grammar. The main their benefit is ability to produce all possible derivation trees and store them in compact form. As result, they can accept all strings, belonging to language, described with given grammar.

The generated GLR parser works according to the paper Elizabeth Scott, Adrian Johnstone, Right Nulled GLR Parsers, 2006.

Generator

Running

To run generator you must type:
YaccConstructor -i <input file> <conversions> -g "RNGLRGenerator <command line options>"

Where &lt;command line options&gt; is sequence of pair: (-option value). All of them are represented as string, separated by whitespaces. The meaning of options is as follows:

  • -module name - The name of generated module. Inserted in the beginning of generated file. default: RNGLR.Parse.
  • -token tokentype - For all tokens will be generated type union Token, each subtype A of what have signature A . default: tokens have no arguments and can't be used in translation.
  • -pos posType - type of Position. During translation it's useful for current nonTerminal to know it's start and end position in source file. Unlike fsyacc, working with already known fslex position, this generator supposed to work with any sequence of tokens. So, position type is unknown in advance. default: Microsoft.FSharp.Text.Lexing.Position.
  • -o output - destination of generated file. default: fileName from IL + ".fs"
  • -table (LA|LALR) - what tables to use. LALR have more ambiguities, but incomparably smaller. default: LALR
  • -fullpath (true|false) - For easier debugging some lines of resulted file are marked, from what file and line they have arised. If fullpath=true, then file name will be printed fully. It can be bad because of usual problems of absolute files. It's usual situation when source and generated files are located in one directory, causing to specify this option to false. default: false
  • -translate (true|false) - whether generate translation description in result file or not. If you are pleasured with only derivation tree, it equals to false. Otherwise, if it's supposed to perform some semantic action, set it to true value. default: true
  • -light (true|false) - whether print #light "off" directive in the beginning of generated file. Currently only 'true' option is supported. default: true

Constraints on a grammar

The grammar must not contain EBNF, meta-rules, Literals, embedded alternatives (only high-level are allowed). It's also recommended not to use nested rules sequences because of unknown performance effect. Following conversion can translate your grammar into appropriate form: ExpandMeta, ExpandEbnf, ReplaceLiterals, ExpandInnerAlt, ExpandBrackets.

On the other hand, such features like resolvers, l- and s- attributes are supported.

Epsilon-productions are allowed. Generation extracts all epsilon-trees from grammar (so in the result tree they doesn't figurate directly and just referenced by some label). For each nonterminal there is a set of epsilon-trees, it can produce.

Parser

Abstract Syntax Tree structure

Firstly, it isn't tree. In fact, it can be any (with hardly described limitations) connected digraph. It can contain cycles, some vertices can be used by several other vertices. If user doesn't want to have all possible derivation trees, it's usually a grammar defect. And it's not recommended to run translation on tree with cycles! It's not tested, but can lead to something like NullReference? exception (because of using not-evaluated value). There are ways to work with such problems, they will be described further.

Each node, if we discard optimization details, have several families of children. Family is some way to expand the nonterminal. It consists of array (also, if simplify) of children. Each children can be:

  • Another Node
  • int with value >= 0. Then it represents a number of token in input sequence
  • int with value < 0. It represents an epsilon-tree for nonterminal, what number equals to the value.
The entire tree can be either Node or epsilon-tree.

Building AST

See RNGLRApplication for example. Project references must include RNGLRCommon and RNGLRParser. Using any tool you have to obtain a sequence of tokens. You can create AST from this sequence using command

<parserModule>.buildAst tokens
, where <parserModule> - name of generated module.

Translation

Suppose you have AST named tree. Then you can perform semantic actions on it. This is allowed by command:
<parserModule>.translate args tree : <result_type>
, where args instantiate a record type (TranslateArguments) with following fields:
  • tokenToRange : Token -> Position * Position
    
    Function, allowing for each token to count it's position in source file. Positions of all non-terminals and epsilon-trees are derived from these values.
  • zeroPosition : Position
    
    But input sequence may be empty => there is no tokens => we don't know any position. In this case zeroPosition allows to translate resultin epsilon-tree, where both start and end positions for all nonterminals equals to zeroPosition.
  • clearAST : boolean
    
    Memory optimization feature. If it's true, then during translation initial tree will be destroyed. In can reduce memory usage during translation.
  • filterEpsilons
    
    If it is true, used epsilon-trees will produce only one tree for each. If you want to have all results (with all epsilon-trees values), you are to set it to false.

Useful Features

Suppose we have AST (named ast). Then we can perform following actions:
  • .defaultAstToDot tree 
    
    Print generated tree in dot format. Then it can be translated in visual format using GraphViz. After installing you can run, e.g.
  • dot  -Tsvg -O
    
    Result svg-file can be viewed in browser. All nonterminals with ambiguities are red-colored and contains '!' character for easier search with Ctrl-F.
  • tree.PrintAst()
    
    Immediately print tree in readable text format. If tree contains cycles, StackOverflowException occurs.
  • tree.collectWarnings tokenToRange
    
    tokenToRange - previously described function (Token -> Position * Position). It traverses AST and for each ambiguity prints:
    • Its range.
    • All rules, it can be derived with.
  • tree.ChooseSingleAst()
    
    Select one tree (kills all cycles and ambiguities). It's unknown, what exactly. Looks like something near longest match, but definitely isn't it in general.
  • tree.ChooseLongestMatch()
    
    Also, select one tree. But now it's guaranted that result corresponds the longest match.
  • tree.EliminateCycles()
    
    Delete cycles from tree. It allows to translate tree. Also used in generation to allow writing epsilon-tree structurally.

Error recovery

Algorithm

Parser has a mechanism for error recovery. The mechanism is based on a strategy of error token (more about).

In short, when error happens then error token is inserted in front of the error detection point. The parser will pop states from the parse stack until this token becomes valid, and then skip symbols from the input until an acceptable symbol is found.

Rule with error token in a grammar

You should add error token in grammar as simple rule:
stmt: firstRule 
    | secondRule
    | ...
    | error
By the way, the error token is always defined.

Info about errors

When you got the tree (named ast), you can learn about all errors which occur during parsing by command
ast.collectErrors tokenToRange
where tokenToRange - previously described function (Token -> Position * Position).

This function traverses AST and returns the array of triples:

  • Start position of unparsed exression.
  • End position of unparsed expression.
  • Array of skipped tokens.

Semantic for error node

You can add semantic for error node.

For instance:

stmt: res=someRule {}
| e=error {printfn "Error on %A token. Expected %A token(s)" e.errorOn e.expected}
Now error node has such fields as:
  • errorOn It's token in which error occurs
  • expected Parser was expecting one of these tokens. Example ariphmetic expression "1 2;" After "1" parser was expecting such symbols as "+", "-", "", "/" etc.
  • tokens Skipped tokens during error recovery.
  • recTokens Parser was looking for one of these tokens during recovery. Example: ariphmetic expression with rule:
    stmt -> error ;
    
    and "1 2 ;" as input. The recovery token in this case is ";".
Fork me on GitHub