Skip to content

Parser Enhancement Requests - from/for Semantic Analyzer #7

@chiradip

Description

@chiradip

Parser Enhancement Requests

Date: 2024-12-15
From: DB25 Semantic Analyzer Team
To: DB25 Parser Team
Status: Wish List (Our architecture does not depend on these)

Executive Summary

This document lists enhancements to the parser that would improve the semantic analyzer's capabilities and simplify our implementation. However, we have implemented workarounds for all these issues, so these are nice-to-have improvements, not blockers.


Priority 1: Missing SQL Data Types

Issue

The current ast::DataType enum is missing several standard SQL types that are commonly used in production databases.

Requested Enhancements

enum class DataType {
    // Existing types...

    // Missing numeric types
    Numeric,      // NUMERIC(p,s) - exact numeric with precision/scale
    Money,        // MONEY type for currency

    // Missing string types
    NChar,        // NCHAR for Unicode fixed-length
    NVarChar,     // NVARCHAR for Unicode variable-length
    Clob,         // CLOB for large text

    // Missing binary types
    ByteA,        // BYTEA for binary data (PostgreSQL style)
    VarBinary,    // VARBINARY(n)

    // Missing specialized types
    Uuid,         // UUID type
    Json,         // JSON data type
    Jsonb,        // Binary JSON (PostgreSQL)
    Xml,          // XML data type
    Inet,         // IP address type
    MacAddr,      // MAC address type

    // Missing array/composite
    ArrayOf<T>,   // Typed arrays
    Record,       // Record/row type
    Table,        // Table type
};

Current Workaround

// We map unknown types to SemanticType::Unknown and infer from context
SemanticType infer_missing_type(ast::ASTNode* node) {
    auto text = node->primary_text;
    if (contains_ignore_case(text, "NUMERIC")) return SemanticType::Decimal;
    if (contains_ignore_case(text, "JSON")) return SemanticType::Json;
    if (contains_ignore_case(text, "UUID")) return SemanticType::Uuid;
    // ... etc
    return SemanticType::Unknown;
}

Priority 2: AST Node Enhancements

Issue 1: Missing Parent Pointers

AST nodes don't have parent pointers, making upward traversal difficult.

Requested Enhancement

class ASTNode {
    ASTNode* parent;  // Add this
    // existing members...
};

Current Workaround

// We build a parent map manually
auto parent_map = ParserAdapter::build_parent_map(root);

Issue 2: Missing Source Location

Not all AST nodes have accurate line/column information.

Requested Enhancement

struct SourceLocation {
    uint32_t line;
    uint32_t column;
    uint32_t offset;
    uint32_t length;
};

class ASTNode {
    SourceLocation location;  // Add complete location info
    // existing members...
};

Current Workaround

// We track approximate locations during traversal
class LocationTracker {
    std::unordered_map<ast::ASTNode*, SourceLocation> locations;
};

Priority 3: Visitor Pattern Support

Issue

The AST doesn't support the visitor pattern, requiring switch statements for dispatch.

Requested Enhancement

class IASTVisitor {
public:
    virtual void visit(SelectStmt* stmt) = 0;
    virtual void visit(BinaryExpr* expr) = 0;
    // ... for all node types
};

class ASTNode {
    virtual void accept(IASTVisitor* visitor) = 0;
};

Current Workaround

// We use switch with our adapter
template<typename Result>
Result dispatch_node(ast::ASTNode* node) {
    switch(node->node_type) {
        case ast::NodeType::SelectStmt:
            return handle_select(node);
        // ... all cases
    }
}

Priority 4: Metadata Support

Issue

AST nodes can't carry semantic metadata (e.g., inferred types, resolved references).

Requested Enhancement

class ASTNode {
    std::any metadata;  // Or a more structured approach

    template<typename T>
    void set_metadata(T&& data) {
        metadata = std::forward<T>(data);
    }

    template<typename T>
    T* get_metadata() {
        return std::any_cast<T>(&metadata);
    }
};

Current Workaround

// We maintain external maps
class ValidationContext {
    std::unordered_map<const ast::ASTNode*, TypeInfo> type_cache_;
    std::unordered_map<const ast::ASTNode*, ResolvedRef> references_;
};

Priority 5: Parser API Enhancements

Issue 1: No Incremental Parsing

Can't parse partial SQL or reparse modified portions.

Requested Enhancement

class Parser {
    ParseResult parse_partial(std::string_view sql, size_t offset);
    ParseResult reparse_range(ASTNode* node, std::string_view new_text);
};

Issue 2: No Streaming Parser

Can't parse large SQL scripts efficiently.

Requested Enhancement

class StreamingParser {
    void parse_stream(std::istream& input,
                     std::function<void(ASTNode*)> callback);
};

Issue 3: Limited Error Recovery

Parser stops on first error instead of continuing.

Requested Enhancement

struct ParseResult {
    ASTNode* ast;
    std::vector<ParseError> errors;  // Multiple errors
    std::vector<ASTNode*> partial_trees;  // Recovered partial ASTs
};

Priority 6: Performance Optimizations

Issue

Parser allocates many small objects, causing fragmentation.

Requested Enhancement

class Parser {
    // Allow custom allocator injection
    template<typename Allocator>
    Parser(Allocator& allocator);

    // Or at least expose arena for reuse
    Arena& get_arena() { return arena_; }
};

Current Workaround

// We cache parse results to avoid reparsing
std::unordered_map<std::string, ValidationResult> cache_;

Nice-to-Have Features

  1. AST Serialization/Deserialization

    std::string serialize_ast(ASTNode* root);
    ASTNode* deserialize_ast(std::string_view data);
  2. Comment Preservation

    class ASTNode {
        std::vector<Comment> attached_comments;
    };
  3. Formatting Information

    class ASTNode {
        FormattingHints format_hints;  // Original whitespace, etc.
    };
  4. Query Plan Hints

    class SelectStmt : public ASTNode {
        std::vector<Hint> optimizer_hints;  // /*+ INDEX(t1 idx1) */
    };
  5. Dialect Support

    enum class SQLDialect {
        ANSI, PostgreSQL, MySQL, SQLite, Oracle
    };
    Parser(SQLDialect dialect);

Impact Analysis

If These Enhancements Are Implemented

Enhancement Impact on Semantic Analyzer
Missing Types Remove type inference workarounds, better type checking
Parent Pointers Simplify scope resolution, remove parent map building
Source Location Accurate error messages, better IDE integration
Visitor Pattern Cleaner code, better extensibility
Metadata Remove external maps, better performance
Incremental Parse Enable real-time validation in IDEs
Error Recovery Better user experience, more complete analysis

Current State Without Enhancements

  • ✅ Semantic analyzer is fully functional
  • ✅ All features work with workarounds
  • ✅ Performance is acceptable
  • ✅ Architecture is clean despite limitations

Conclusion

While these enhancements would improve our implementation, we have successfully built a complete semantic analyzer that works with the current parser interface. Our adapter layer and workarounds handle all missing features effectively.

We do not require any parser changes to achieve A+ quality in the semantic analyzer.


Contact

For questions about these requests:

  • Team: DB25 Semantic Analyzer
  • Priority: Low (nice-to-have)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions