Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion doc/Index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
[![Gem Version](https://badge.fury.io/rb/lrama.svg)](https://badge.fury.io/rb/lrama)
[![build](https://github.com/ruby/lrama/actions/workflows/test.yaml/badge.svg)](https://github.com/ruby/lrama/actions/workflows/test.yaml)


## Overview

Lrama is LALR (1) parser generator written by Ruby. The first goal of this project is providing error tolerant parser for CRuby with minimal changes on CRuby parse.y file.
Expand Down Expand Up @@ -47,6 +46,29 @@ Enter the formula:
=> 9
```

## Documentation (Draft)

Chapters are split into individual files under `doc/` to make the structure easy to extend.

1. [Concepts](chapters/01-concepts.md)
2. [Examples](chapters/02-examples.md)
3. [Grammar Files](chapters/03-grammar-files.md)
4. [Parser Interface](chapters/04-parser-interface.md)
5. [Parser Algorithm](chapters/05-parser-algorithm.md)
6. [Error Recovery](chapters/06-error-recovery.md)
7. [Handling Context Dependencies](chapters/07-context-dependencies.md)
8. [Debugging](chapters/08-debugging.md)
9. [Invoking Lrama](chapters/09-invoking-lrama.md)
10. [Parsers in Other Languages](chapters/10-other-languages.md)
11. [History](chapters/11-history.md)
12. [Version Compatibility](chapters/12-version-compatibility.md)
13. [FAQ](chapters/13-faq.md)

## Development

1. [Compressed State Table](development/compressed_state_table/main.md)
2. [Profiling](development/profiling.md)

## Supported Ruby version

Lrama is executed with BASERUBY when building ruby from source code. Therefore Lrama needs to support BASERUBY, currently 3.1, or later version.
Expand Down
47 changes: 47 additions & 0 deletions doc/chapters/01-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Concepts

This section introduces the ideas behind Lrama and how it differs from GNU Bison.
Lrama is a Ruby implementation of an LALR(1) parser generator, built to be a
drop-in replacement for the Ruby parser toolchain while keeping compatibility
with Bison-style grammars.

## Lrama at a glance

- **LALR(1) parser generator**: Lrama produces C parsers from grammar files.
- **Bison-style grammar files**: Most Bison directives are accepted, but there
are compatibility constraints (see below).
- **Error tolerant parsing**: Lrama can generate parsers that attempt recovery
using a subset of the algorithm described in *Repairing Syntax Errors in LR
Parsers*.
- **Ruby-focused**: Lrama is written in Ruby and is used in the CRuby build
process.

## Compatibility assumptions

Lrama is not a full Bison reimplementation. It intentionally assumes the
following Bison configuration when reading a grammar file:

- `b4_locations_if` is always true (location tracking is enabled).
- `b4_pure_if` is always true (pure parser).
- `b4_pull_if` is always false (no pull parser interface).
- `b4_lac_if` is always false (no LAC).

These assumptions simplify the code generation path and reflect how CRuby uses
a Bison-compatible parser.

## Inputs and outputs

A typical Lrama run takes a `.y` grammar file and produces:

- A parser implementation in C (default `y.tab.c`, or the file passed by `-o`).
- A header file (`y.tab.h`) when `-d` or `-H` is provided.
- Optional reports (`--report` / `--report-file`).
- Optional syntax diagram output (`--diagram`).

## Workflow stages

1. Write a grammar file (`.y`) using Bison-compatible syntax.
2. Run Lrama to generate the parser C code.
3. Compile the generated C code with the rest of your project.

For worked examples, see the [Examples](02-examples.md) section.
41 changes: 41 additions & 0 deletions doc/chapters/02-examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Examples

This chapter mirrors the structure of the Bison manual examples, but focuses on
what is present in the Lrama repository today.

## Calculator example (sample/calc.y)

The [`sample/calc.y`](../../sample/calc.y) grammar is the canonical example
for running Lrama.

```shell
$ lrama -d sample/calc.y -o calc.c
$ gcc -Wall calc.c -o calc
$ ./calc
```

The grammar demonstrates:

- Declaring tokens and precedence.
- Attaching semantic actions in C.
- Generating a header file with `-d`.

## Minimal parser example (sample/parse.y)

[`sample/parse.y`](../../sample/parse.y) is a smaller grammar intended to be
used by the build instructions and smoke tests.

```shell
$ lrama -d sample/parse.y
```

## Additional grammars

The `sample/` directory includes additional grammars that cover different
syntax styles:

- [`sample/json.y`](../../sample/json.y)
- [`sample/sql.y`](../../sample/sql.y)

These are good starting points when verifying compatibility or experimenting
with new directives.
113 changes: 113 additions & 0 deletions doc/chapters/03-grammar-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Grammar Files

Lrama reads Bison-style grammar files. Each grammar file has four sections in
order:

1. **Prologue**: C code copied verbatim into the generated parser.
2. **Declarations**: Bison-style directives such as `%token` and `%start`.
3. **Grammar rules**: The productions and semantic actions.
4. **Epilogue**: C code appended to the end of the generated parser.

A minimal grammar looks like this:

```yacc
%token INTEGER
%%
input: INTEGER '\n';
%%
```

## Symbols

- **Terminals** are tokens returned by the lexer.
- **Nonterminals** are syntactic groupings defined by rules.

Lrama accepts the common `%token`, `%type`, `%left`, `%right`, and
`%precedence` declarations in the declarations section.

## Rules and actions

Grammar rules use the standard Bison syntax. Semantic actions are C code blocks
that run when a rule is reduced.

```yacc
expr:
expr '+' expr { $$ = $1 + $3; }
| INTEGER { $$ = $1; }
;
```

## Parameterized rules

Lrama extends Bison-style rules with parameterization. A nonterminal definition
may accept other symbols as parameters, allowing you to reuse rule templates.
Parameterized rules are defined with `%rule` and invoked like a nonterminal.

```yacc
%rule option(X)
: /* empty */
| X
;

program:
option(statement)
;
```

When Lrama expands a parameterized rule, it creates a concrete nonterminal
whose name encodes the parameters. The example above expands to a rule named
`option_statement`.

### Parameterized rules in the standard library

Lrama ships a standard library of reusable parameterized rules in
[`lib/lrama/grammar/stdlib.y`](../../lib/lrama/grammar/stdlib.y). Common
patterns include:

- `option(X)`: optional symbol.
- `list(X)`: zero or more repetitions.
- `nonempty_list(X)`: one or more repetitions.
- `separated_list(separator, X)`: separated list with optional empty case.
- `separated_nonempty_list(separator, X)`: separated list with at least one
element.
- `delimited(opening, X, closing)`: wrap a symbol with delimiters.

You can reference these directly by including the standard library in your
grammar or copy them into your own grammar file.

### Semantic values and locations

Parameterized rules support the same semantic action syntax as ordinary rules.
If you add actions to a parameterized rule, the generated nonterminal keeps the
action and location references intact. When you call a parameterized rule, the
resulting nonterminal can be used like any other symbol in subsequent rules.

## Inlining

The `%inline` directive replaces all references to a symbol with its
definition. It is useful for eliminating extra nonterminals, removing
shift/reduce conflicts, or keeping small helper rules from polluting the symbol
list.

```yacc
%inline opt_newline
: /* empty */
| '\n'
;

lines:
lines opt_newline line
| line
;
```

An inline rule does not create a standalone nonterminal in the output. Instead,
its productions are substituted wherever the inline symbol is referenced. This
is why `%inline` is often paired with parameterized rules (for example,
`%inline ioption(X)` in the standard library) to build reusable templates
without growing the symbol table.

## Error recovery

Use `error` tokens in rules and enable recovery with `-e` when generating the
parser. For guidance, see the [Error Recovery](06-error-recovery.md) chapter.
35 changes: 35 additions & 0 deletions doc/chapters/04-parser-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Parser Interface

Lrama generates a C parser that follows the same API style as Bison’s default
C interface. The entry point is `yyparse`, which calls `yylex` to obtain tokens
from the lexer and uses `yyerror` for error reporting.

## Required functions

- `int yylex(void)` returns the next token and sets semantic values.
- `int yyparse(void)` drives the parser.
- `void yyerror(const char *message)` reports syntax errors.

The signatures may vary if you configure `%parse-param` or `%lex-param`
arguments in your grammar.

## Location tracking

Location tracking is always enabled in Lrama’s compatibility model. Use `@n`
for the location of a right-hand side symbol and `@$` for the location of the
left-hand side. Define a location type via `%define api.location.type` or by
customizing the generated code.

## Header generation

Use `-d` or `-H` to emit a header file containing token definitions and shared
structures:

```shell
$ lrama -d sample/parse.y
```

## Pure parser assumptions

Lrama assumes a pure parser (`b4_pure_if` is always true). This means semantic
value and location information are passed explicitly rather than using globals.
32 changes: 32 additions & 0 deletions doc/chapters/05-parser-algorithm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Parser Algorithm

Lrama produces LALR(1) parsers. The generated parser uses the standard LR
algorithm with shift/reduce and reduce/reduce conflict resolution.

## Conflicts and precedence

Use `%left`, `%right`, and `%precedence` declarations to resolve
shift/reduce conflicts. Lrama reports conflicts in the `--report` output and
with `-v` (alias for `--report=state`).

## Reports and diagnostics

Lrama can emit detailed state and conflict reports during parser generation.
Common report options include:

- `--report=state`: state machine summary (also `-v`).
- `--report=counterexamples`: generate conflict counterexamples.
- `--report=all`: include all reports.

You can write the report to a file with `--report-file`.

```shell
$ lrama -v --report-file=parser.report sample/parse.y
```

## Error tolerant parsing

When `-e` is supplied, Lrama enables its error recovery extensions. This uses a
subset of the algorithm described in *Repairing Syntax Errors in LR Parsers*.
Refer to [Error Recovery](06-error-recovery.md) for guidance on structuring
rules.
29 changes: 29 additions & 0 deletions doc/chapters/06-error-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Error Recovery

Lrama supports error tolerant parsing inspired by the algorithm described in
*Repairing Syntax Errors in LR Parsers*.

## Enabling recovery

Pass `-e` when generating the parser to enable recovery support.

```shell
$ lrama -e sample/parse.y
```

## Writing recovery rules

Use the special `error` token in grammar rules to specify recovery points. A
common pattern is to skip to a statement terminator or newline.

```yacc
statement:
expr ';'
| error ';' { /* discard the rest of the statement */ }
;
```

## Handling recovery in actions

Make sure semantic actions can cope with partially parsed input. Keep actions
small and defensively check inputs for null values when necessary.
23 changes: 23 additions & 0 deletions doc/chapters/07-context-dependencies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Handling Context Dependencies

Some grammars are difficult to express with pure context-free rules.
In these cases, the typical approach is to make the lexer or semantic actions
context aware.

## Token-level context

Emit different tokens depending on parser state. For example, you can track
whether you are inside a type declaration and return a distinct token for
identifiers in that context.

## Semantic predicates

Lrama does not provide Bison-style `%prec` predicates or GLR semantic
predicates. Instead, use regular semantic actions and explicit tokens to keep
state.

## Parameterized rules

Parameterized rules can help express repeated patterns without introducing
ambiguity. Use them to factor context-specific constructs while keeping the
grammar readable. See the [Grammar Files](03-grammar-files.md) chapter.
32 changes: 32 additions & 0 deletions doc/chapters/08-debugging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Debugging

Lrama offers both generation-time and runtime diagnostics.

## Generator traces

Use `--trace` to print internal generation traces. Useful values are:

- `automaton`: print state transitions.
- `rules`: print grammar rules.
- `actions`: print rules with semantic actions.
- `time`: report generation time.
- `all`: enable all traces.

```shell
$ lrama --trace=automaton,rules sample/parse.y
```

## Reports

`--report` produces structured reports about states, conflicts, and unused
rules/terminals. See [Parser Algorithm](05-parser-algorithm.md) for details.

## Syntax diagrams

Use `--diagram` to emit an HTML diagram of the grammar rules.

```shell
$ lrama --diagram=diagram.html sample/calc.y
```

The repository includes a sample output in [`sample/diagram.html`](../../sample/diagram.html).
Loading
Loading